1 R/Rstudio Interface

R is a dialect of the S language. It is a case-sensitive, interpreted language. You can enter commands one at a time at the command prompt (>) or run a set of commands from a source file. There is a wide variety of data types, including vectors (numerical, character, logical), matrices, data frames, and lists. Most functionality is provided through built-in and user-created functions and all data objects are kept in memory during an interactive session. Basic functions are available by default. Other functions are contained in packages that can be attached to a current session as needed.

R is a case sensitive language. FOO, Foo, and foo are three different objects!

This section describes working with the R interface. A key skill to using R effectively is learning how to use the built-in help system. Other sections describe the working environment, inputting programs and outputting results, installing new functionality through packages, GUIs that have been developed for R, customizing the environment, producing high quality output, and running programs in batch. A fundamental design feature of R is that the output from most functions can be used as input to other functions. This is described in reusing results. ## The Workspace

The workspace is your current R working environment and includes any user-defined objects (vectors, matrices, data frames, lists, functions). At the end of an R session, the user can save an image of the current workspace that is automatically reloaded the next time R is started. Commands are entered interactively at the R user prompt. Up and down arrow keys scroll through your command history.

You will probably want to keep different projects in different physical directories. Here are some standard commands for managing your workspace.

getwd() # print the current working directory - cwd 
## [1] "/Users/bherault/Library/Mobile Documents/com~apple~CloudDocs/Bruno/Teaching/INPHB/ED"
ls()    # list the objects in the current workspace 
## character(0)
#setwd(mydirectory)         change to mydirectory 
#setwd("c:/docs/mydir")     note / instead of \ in windows 
#setwd("/usr/rob/mydir")    on linux 

view and set options for the session

help(options) # learn about available options 
options(digits=3) # number of digits to print on output 

work with your previous commands

# history() # display last 25 commands 
# history(max.show=Inf) # display all previous commands 

save your command history

# savehistory(file="myfile") # default is ".Rhistory" 

recall your command history

#loadhistory(file="myfile") # default is ".Rhistory" 

load a workspace into the current session if you don’t specify the path, the cwd is assumed

#load("myfile.RData") 
#q() # quit R. You will be prompted to save the workspace. 

Important Note to Windows Users: R gets confused if you use a path in your code like:

# c:\mydocuments\myfile.txt 

This is because R sees “" as an escape character. Instead, use:

# c:\\mydocuments\\myfile.txt # c:/mydocuments/myfile.txt 

Either will work.

1.1 Input/Output

By default, launching R starts an interactive session with input from the keyboard and output to the screen. However, you can have input come from a script file (a file containing R commands) and direct output to a variety of destinations.

1.1.1 Input

The source() function runs a script in the current session. If the filename does not include a path, the file is taken from the current working directory.

input a script By default, launching R starts an interactive session with input from the keyboard and output to the screen. However, you can have input come from a script file (a file containing R commands) and direct output to a variety of destinations.

# source("myfile") 

1.1.2 Output

The sink() function defines the direction of the output.

#sink() 

1.2 Packages

Packages are collections of R functions, data, and compiled code in a well-defined format. The directory where packages are stored is called the library. R comes with a standard set of packages. Others are available for download and installation. Once installed, they have to be loaded into the session to be used.

.libPaths() # get library location 
## [1] "/Library/Frameworks/R.framework/Versions/3.4/Resources/library"
library()  # see all packages installed 
search()    # see packages currently loaded 
## [1] ".GlobalEnv"        "package:stats"     "package:graphics" 
## [4] "package:grDevices" "package:utils"     "package:datasets" 
## [7] "package:methods"   "Autoloads"         "package:base"

1.2.1 Adding Packages

You can expand the types of analyses you do by adding other packages. A complete list of contributed packages is available from CRAN.

Follow these steps: Download and install a package (you only need to do this once). To use the package, invoke the library(package) command to load it into the current session. (You need to do this once in each session, unless you customize your environment to automatically load it each time.)

On MS Windows: Choose Install Packages from the Packages menu. Select a CRAN Mirror. (e.g. Norway) Select a package. (e.g. boot) Then use the library(package) function to load it for use.

On Linux: Download the package of interest as a compressed file. At the command prompt, install it using R CMD INSTALL [options] [l-lib] pkgs Use the library(package) function within R to load it for use in the session.

2 Data Input

Unlike SAS, which has DATA and PROC steps, R has data structures (vectors, matrices, arrays, data frames) that you can operate on through functions that perform statistical analyses and create graphs. In this way, R is similar to PROC IML.

This section describes how to enter or import data into R, and how to prepare it for use in statistical analyses. Topics include R data structures, importing data (from Excel, SPSS, SAS, Stata, and ASCII Text Files), entering data from the keyboard, creating an interface with a database management system, exporting data (to Excel, SPSS, SAS, Stata, and Tab Delimited Text Files), annotating data (with variable labels and value labels), and listing data. In addition, methods for handling missing values and date values are presented.

2.1 Data types

R has a wide variety of data types including scalars, vectors (numerical, character, logical), matrices, data frames, and lists. ###Vectors

a <- c(1,2,5.3,6,-2,4) # numeric vector 
a 
## [1]  1.0  2.0  5.3  6.0 -2.0  4.0
b <- c("one","two","three") # character vector 
b 
## [1] "one"   "two"   "three"
c <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) #logical vector 
c 
## [1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE

Refer to elements of a vector using subscripts.

a[c(2,4)] # 2nd and 4th elements of vector 
## [1] 2 6

2.1.1 Matrices

All columns in a matrix must have the same mode(numeric, character, etc.) and the same length. The general format is

mymatrix <- matrix(vector, nrow=r, ncol=c, byrow=FALSE, dimnames=list(char_vector_rownames, char_vector_colnames))

byrow=TRUE indicates that the matrix should be filled by rows. byrow=FALSE indicates that the matrix should be filled by columns (the default). dimnames provides optional labels for the columns and rows.

generates 5 x 4 numeric matrix

y<-matrix(1:20, nrow=5,ncol=4) 
y 
##      [,1] [,2] [,3] [,4]
## [1,]    1    6   11   16
## [2,]    2    7   12   17
## [3,]    3    8   13   18
## [4,]    4    9   14   19
## [5,]    5   10   15   20
z<-matrix(1:20, nrow=5,ncol=4, byrow=T) 
z 
##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12
## [4,]   13   14   15   16
## [5,]   17   18   19   20

another example

cells <- c(1,26,24,68) 
rnames <- c("R1", "R2") 
cnames <- c("C1", "C2") 
mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=TRUE, dimnames=list(rnames, cnames))
mymatrix 
##    C1 C2
## R1  1 26
## R2 24 68

How to identify rows, columns or elements using subscripts.

z[,4] # 4th column of matrix Z 
## [1]  4  8 12 16 20
z[3,] # 3rd row of matrix Z 
## [1]  9 10 11 12
z[2:4,1:3] # rows 2,3,4 of columns 1,2,3 
##      [,1] [,2] [,3]
## [1,]    5    6    7
## [2,]    9   10   11
## [3,]   13   14   15

2.1.2 Arrays

Arrays are similar to matrices but can have more than two dimensions. See help(array) for details.

2.1.3 Data Frames

A data frame is more general than a matrix, in that different columns can have different modes (numeric, character, factor, etc.).

d <- c(1,2,3,4) 
e <- c("red", "white", "red", NA) 
f <- c(TRUE,TRUE,TRUE,FALSE) 
mydata <- data.frame(d,e,f) 
names(mydata) <-c("ID","Color","Passed") # variable names 
mydata 
str(mydata) 
## 'data.frame':    4 obs. of  3 variables:
##  $ ID    : num  1 2 3 4
##  $ Color : Factor w/ 2 levels "red","white": 1 2 1 NA
##  $ Passed: logi  TRUE TRUE TRUE FALSE

There are a variety of ways to identify the elements of a data frame.

mydata[2:3] # columns 2,3 of data frame 
mydata[c("ID","Passed")] #columns ID and Age from data frame 
mydata$Passed # variable x1 in the data frame
## [1]  TRUE  TRUE  TRUE FALSE

2.1.4 Lists

An ordered collection of objects (components). A list allows you to gather a variety of (possibly unrelated) objects under one name.

example of a list with 4 components - a string, a numeric vector, a matrix, and a scaler

w <- list(name="Fred", mynumbers=a, mymatrix=y, age=5.3) 
w
## $name
## [1] "Fred"
## 
## $mynumbers
## [1]  1.0  2.0  5.3  6.0 -2.0  4.0
## 
## $mymatrix
##      [,1] [,2] [,3] [,4]
## [1,]    1    6   11   16
## [2,]    2    7   12   17
## [3,]    3    8   13   18
## [4,]    4    9   14   19
## [5,]    5   10   15   20
## 
## $age
## [1] 5.3

example of a list containing two lists

v <- c(w,w) 
v 
## $name
## [1] "Fred"
## 
## $mynumbers
## [1]  1.0  2.0  5.3  6.0 -2.0  4.0
## 
## $mymatrix
##      [,1] [,2] [,3] [,4]
## [1,]    1    6   11   16
## [2,]    2    7   12   17
## [3,]    3    8   13   18
## [4,]    4    9   14   19
## [5,]    5   10   15   20
## 
## $age
## [1] 5.3
## 
## $name
## [1] "Fred"
## 
## $mynumbers
## [1]  1.0  2.0  5.3  6.0 -2.0  4.0
## 
## $mymatrix
##      [,1] [,2] [,3] [,4]
## [1,]    1    6   11   16
## [2,]    2    7   12   17
## [3,]    3    8   13   18
## [4,]    4    9   14   19
## [5,]    5   10   15   20
## 
## $age
## [1] 5.3

Identify elements of a list using the [[]] convention.

w[[3]] # 2nd component of the list 
##      [,1] [,2] [,3] [,4]
## [1,]    1    6   11   16
## [2,]    2    7   12   17
## [3,]    3    8   13   18
## [4,]    4    9   14   19
## [5,]    5   10   15   20

2.1.5 Factors

Tell R that a variable is nominal by making it a factor. The factor stores the nominal values as a vector of integers in the range [ 1… k ] (where k is the number of unique values in the nominal variable), and an internal vector of character strings (the original values) mapped to these integers.

variable gender with 20 “male” entries and 30 “female” entries

gender <- c(rep("male",20), rep("female", 30)) 
gender <- factor(gender) 
gender 
##  [1] male   male   male   male   male   male   male   male   male   male  
## [11] male   male   male   male   male   male   male   male   male   male  
## [21] female female female female female female female female female female
## [31] female female female female female female female female female female
## [41] female female female female female female female female female female
## Levels: female male

stores gender as 20 1s and 30 2s and associates 1=female, 2=male internally (alphabetically). R now treats gender as a nominal variable

summary(gender) 
## female   male 
##     30     20

An ordered factor is used to represent an ordinal variable. Variable rating coded as “large”, “medium”, “small’. rating <- ordered(rating) recodes rating to 1,2,3 and associates 1=large, 2=medium, 3=small internally. R now treats rating as ordinal

R will treat factors as nominal variables and ordered factors as ordinal variables in statistical proceedures and graphical analyses. You can use options in the factor() and ordered() functions to control the mapping of integers to strings (overiding the alphabetical ordering). You can also use factors to create value labels.

2.1.6 Useful Functions

length(w) # number of elements or components 
## [1] 4
str(w) # structure of an object 
## List of 4
##  $ name     : chr "Fred"
##  $ mynumbers: num [1:6] 1 2 5.3 6 -2 4
##  $ mymatrix : int [1:5, 1:4] 1 2 3 4 5 6 7 8 9 10 ...
##  $ age      : num 5.3
class(w)  # class or type of an object 
## [1] "list"
names(w)  #names 
## [1] "name"      "mynumbers" "mymatrix"  "age"
c(a,a)       # combine objects into a vector 
##  [1]  1.0  2.0  5.3  6.0 -2.0  4.0  1.0  2.0  5.3  6.0 -2.0  4.0
cbind(a, a) # combine objects as columns 
##         a    a
## [1,]  1.0  1.0
## [2,]  2.0  2.0
## [3,]  5.3  5.3
## [4,]  6.0  6.0
## [5,] -2.0 -2.0
## [6,]  4.0  4.0
rbind(a, a) # combine objects as rows 
##   [,1] [,2] [,3] [,4] [,5] [,6]
## a    1    2  5.3    6   -2    4
## a    1    2  5.3    6   -2    4
a     # prints the object a 
## [1]  1.0  2.0  5.3  6.0 -2.0  4.0
ls()  # list current objects 
##  [1] "a"        "b"        "c"        "cells"    "cnames"   "d"       
##  [7] "e"        "f"        "gender"   "mydata"   "mymatrix" "rnames"  
## [13] "v"        "w"        "y"        "z"
rm(v) # delete an object 
ls() # without v 
##  [1] "a"        "b"        "c"        "cells"    "cnames"   "d"       
##  [7] "e"        "f"        "gender"   "mydata"   "mymatrix" "rnames"  
## [13] "w"        "y"        "z"

2.2 Importing data

Importing data into R is fairly simple. For Stata and Systat, use the foreign package. For SPSS and SAS I would recommend the Hmisc package for ease and functionality. Example of importing data are provided below.

2.2.1 From A Comma Delimited Text File

# first row contains variable names, comma is separator 
# assign the variable id to row names
# note the / instead of \ on mswindows systems 
#mydata <- read.table("c:/mydata.csv", header=TRUE, sep=",", row.names="id")

(To practice importing a csv file, try import the Paracou dataset using read.csv2())

paracou <- read.csv2("paracou2015.csv")
head(paracou)
str(paracou)
## 'data.frame':    19 obs. of  20 variables:
##  $ NomForet     : Factor w/ 1 level "Paracou": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Libelle      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ NumCarre     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ NumArbre     : int  2974 2973 2972 2971 2970 2969 2968 2967 2966 2965 ...
##  $ X            : num  27.5 15 62 34.5 65 ...
##  $ Y            : num  190 188 182 184 201 ...
##  $ Xutm         : num  285114 285102 285148 285122 285146 ...
##  $ Yutm         : num  583051 583046 583052 583047 583071 ...
##  $ Annee        : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ Circonference: num  36 34 32 38 32.5 33.5 32.5 34.5 33 32 ...
##  $ MesuMort     : Factor w/ 1 level "VRAI": 1 1 1 1 1 1 1 1 1 1 ...
##  $ idCodeMesure : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Echelle      : Factor w/ 1 level "FAUX": 1 1 1 1 1 1 1 1 1 1 ...
##  $ idVern       : int  751 606 712 623 629 802 105 403 616 404 ...
##  $ NomVern      : Factor w/ 16 levels "baaka koko","bakuman",..: 12 13 9 5 2 16 4 10 6 11 ...
##  $ idTaxon      : int  33679 33679 33679 33679 33679 33679 33679 33679 33679 33679 ...
##  $ Famille      : Factor w/ 1 level "Indet.": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Genre        : Factor w/ 1 level "Indet.Indet.": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Espece       : Factor w/ 1 level "Indet.": 1 1 1 1 1 1 1 1 1 1 ...
##  $ idArbre      : int  249619 249618 249617 249616 249615 249614 249613 249612 249611 249610 ...

2.2.2 From Excel

One of the best ways to read an Excel file is to export it to a comma delimited file and import it using the method above. Alternatively you can use the xlsx package to access Excel files. The first row should contain variable/column names.

Read in the first worksheet from the workbook myexcel.xlsx first row contains variable names

#library(xlsx)
#mydata <- read.xlsx("c:/myexcel.xlsx", 1)

Read in the worksheet named mysheet

#mydata <- read.xlsx("c:/myexcel.xlsx", sheetName = "mysheet")

2.3 Keyboard Inputs

Usually you will obtain a data frame by importing it from SAS, SPSS, Excel, Stata, a database, or an ASCII file. To create it interactively, you can do something like the following.

2.3.1 create a data frame from scratch

age <- c(25, 30, 56)
gender <- c("male", "female", "male")
weight <- c(160, 110, 220) 
mydata <- data.frame(age,gender,weight)
mydata

You can also use R’s built in spreadsheet to enter the data interactively, as in the following example.

2.3.2 enter data using editor

#mydata <- data.frame(age=numeric(0), gender=character(0), weight=numeric(0))
#mydata <- edit(mydata)

note that without the assignment in the line above, the edits are not saved!

2.4 Exporting data

Exporting Data There are numerous methods for exporting R objects into other formats. For SPSS, SAS and Stata, you will need to load the foreign packages. For Excel, you will need the xlsReadWrite package.

2.4.1 To A Tab Delimited Text File

write.table(mydata, "mydata", sep="\t")

2.4.2 To an Excel Spreadsheet

# library(xlsx)
# write.xlsx(mydata, "mydata")

Be careful with Rjava (64 or 32 bits)

2.5 Viewing Data

There are a number of functions for listing the contents of an object or dataset.

List objects in the working environment

ls()
##  [1] "a"        "age"      "b"        "c"        "cells"    "cnames"  
##  [7] "d"        "e"        "f"        "gender"   "mydata"   "mymatrix"
## [13] "paracou"  "rnames"   "w"        "weight"   "y"        "z"

List the variables in mydata

names(mydata)
## [1] "age"    "gender" "weight"

List the structure of mydata

str(mydata)
## 'data.frame':    3 obs. of  3 variables:
##  $ age   : num  25 30 56
##  $ gender: Factor w/ 2 levels "female","male": 2 1 2
##  $ weight: num  160 110 220

List levels of factor gender in mydata

levels(mydata$gender)
## [1] "female" "male"

Dimensions of an object

dim(mydata)
## [1] 3 3

Class of an object (numeric, matrix, data frame, etc)

class(mydata)
## [1] "data.frame"

Print mydata

mydata

Print first 10 rows of paracou

head(paracou, n=10)

Print last 5 rows of paracou

tail(paracou, n=5)

2.6 Missing data

In R, missing values are represented by the symbol NA (not available). Impossible values (e.g., dividing by zero) are represented by the symbol NaN (not a number). Unlike SAS, R uses the same symbol for character and numeric data.

2.6.1 Testing for Missing Values

is.na(2) # returns TRUE of x is missing
## [1] FALSE
y <- c(1,2,3,NA)
is.na(y) # returns a vector (F F F T)
## [1] FALSE FALSE FALSE  TRUE

2.6.2 Recoding Values to Missing

-> recode 99 to missing for variable v1 -> select rows where v1 is 99 and recode column v1

#mydata$v1[mydata$v1==99] <- NA

2.6.3 Excluding Missing Values from Analyses

Arithmetic functions on missing values yield missing values.

x <- c(1,2,NA,3)
mean(x) # returns NA
## [1] NA
mean(x, na.rm=TRUE) # returns 2
## [1] 2

The function complete.cases() returns a logical vector indicating which cases are complete.

-> List rows of data that have missing values

mydata[!complete.cases(mydata),]

The function na.omit() returns the object with listwise deletion of missing values.

-> Create new dataset without missing data

x
## [1]  1  2 NA  3
newdata <- na.omit(x)
newdata
## [1] 1 2 3
## attr(,"na.action")
## [1] 3
## attr(,"class")
## [1] "omit"

2.7 Date values

Dates are represented as the number of days since 1970-01-01, with negative values for earlier dates.

Use as.Date( ) to convert strings to dates

mydates <- as.Date(c("2007-06-22", "2004-02-13"))
mydates
## [1] "2007-06-22" "2004-02-13"
class(mydates)
## [1] "Date"
# number of days between 6/22/07 and 2/13/04 
days <- mydates[1] - mydates[2]
days
## Time difference of 1225 days

Sys.Date( ) returns today’s date. date() returns the current date and time.

Sys.Date( )
## [1] "2018-01-15"
date()
## [1] "Mon Jan 15 10:40:20 2018"

Convert date info in format ‘mm/dd/yyyy’

strDates <- c("01/05/1965", "08/16/1975")
dates <- as.Date(strDates, "%m/%d/%Y")
dates
## [1] "1965-01-05" "1975-08-16"

The default format is yyyy-mm-dd

mydates <- as.Date(c("2007-06-22", "2004-02-13"))

Date to Character You can convert dates to character data using the as.Character( ) function.

Convert dates to character data

strDates <- as.character(dates)

3 Data Management

Once you have access to your data, you will want to massage it into useful form. This includes creating new variables (including recoding and renaming existing variables), sorting and merging datasets, aggregating data, reshaping data, and subsetting datasets (including selecting observations that meet criteria, randomly sampling observeration, and dropping or keeping variables).

Each of these activities usually involve the use of R’s built-in operators (arithmetic and logical) and functions (numeric, character, and statistical). Additionally, you may need to use control structures (if-then, for, while, switch) in your programs and/or create your own functions. Finally you may need to convert variables or datasets from one type to another (e.g. numeric to character or matrix to data frame).

This section describes each task from an R perspective.

3.1 Creating New Variables

Use the assignment operator <- to create new variables. A wide array of operators and functions are available here.

Three examples for doing the same computations

paracou$XYsum1 <- paracou$X + paracou$Y
paracou$XYmean1 <- (paracou$X + paracou$Y)/2
head(paracou)
attach(paracou)
paracou$XYsum2 <- X + Y
paracou$XYmean2 <- (X + Y)/2
detach(paracou)
head(paracou)
paracou <- transform( paracou,
sum3 = X + Y,
mean3 = (X + Y)/2 
)
head(paracou)

3.1.1 Recoding variables

In order to recode data, you will probably use one or more of R’s control structures.

1st example: create 2 size categories (big and small) according to Circonference >35

paracou$size <- ifelse(paracou$Circonference > 35, 
c("big"), c("small")) 
head(paracou)

Another example: create 3 size categories with thresolds: 34 and 37

attach(paracou)
paracou$size[Circonference < 34] <- "small"
paracou$size[Circonference >= 34 & Circonference <= 37] <- "medium"
paracou$size[Circonference > 37] <- "big"
detach(paracou)
head(paracou)

3.1.2 Renaming variables

You can rename variables programmatically or interactively.

-> rename interactively (Windows user only, not working on Mac, Ubuntu..)

#library(Xquartz)
#fix(paracou) # results are saved on close 

-> rename programmatically

library(reshape)
paracou <- rename(paracou, c(size="newname"))
head(paracou)

-> re-enter all the variable names in order changing the ones you need to change. The limitation is that you need to enter all of them!

mymatrix
##    C1 C2
## R1  1 26
## R2 24 68
colnames(mymatrix) <- c("x1","x2")
mymatrix
##    x1 x2
## R1  1 26
## R2 24 68

3.2 Operators

R’s binary and logical operators will look very familiar to programmers. Note that binary operators work on vectors and matrices as well as scalars.

3.2.1 Arithmetic Operators

Operator Description
+ addition
- subtraction
* multiplication
/ division
^ or ** exponentiation
x %% y modulus (x mod y) 5%%2 is 1
x %/% y integer division 5%/%2 is 2

3.2.2 Logical Operators

Operator Description
< less than
<= less than or equal to
> greater than
>= greater than or equal to
== exactly equal to
!= not equal to
!x Not x
x VertBar y x OR y
x & y x AND y
isTRUE(x) test if X is TRUE

3.2.3 Examples

x <- c(1:10)
x[(x>8) | (x<5)]
## [1]  1  2  3  4  9 10
x <- c(1:10)
x
##  [1]  1  2  3  4  5  6  7  8  9 10
x > 8
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
x < 5
##  [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
x > 8 | x < 5
##  [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE
x[c(T,T,T,T,F,F,F,F,T,T)]
## [1]  1  2  3  4  9 10

3.3 Built-in Functions

Almost everything in R is done through functions. Here I’m only refering to numeric and character functions that are commonly used in creating or recoding variables.

3.3.1 Numeric Functions

Function Description
abs(x) absolute value
sqrt(x) square root
ceiling(x) ceiling(3.475) is 4
floor(x) floor(3.475) is 3
trunc(x) trunc(5.99) is 5
round(x, digits=n) round(3.475, digits=2) is 3.48
cos(x), sin(x), tan(x) also acos(x), cosh(x), acosh(x), etc.
log(x) natural logarithm
log10(x) common logarithm
exp(x) e^x

3.3.2 Character Functions

Function Description
substr(x, start=n1, stop=n2) Extract or replace substrings in a vector
grep(pattern, x ) Search for pattern in x. Returns matching indices.
sub(pattern, replacement, x) Find pattern in x and replace.
strsplit(x, split) Split the elements of vector x at split.
paste(…, sep=“”) Concatenate strings using sep to seperate them.
toupper(x) Uppercase
tolower(x) Lowercase
x <- "abcdef" 
substr(x, 2, 4)
## [1] "bcd"
substr(x, 2, 4)
## [1] "bcd"
grep("A", c("b","A","c"), fixed=TRUE)
## [1] 2
sub("\\s",".","Hello There")
## [1] "Hello.There"
strsplit("abc", "")
## [[1]]
## [1] "a" "b" "c"
paste("x",1:3,sep="")
## [1] "x1" "x2" "x3"
paste("x",1:3,sep="M")
## [1] "xM1" "xM2" "xM3"
paste("Today is", date())
## [1] "Today is Mon Jan 15 10:40:20 2018"

3.3.3 Statistical Probability Functions

The following table describes functions related to probaility distributions. For random number generators below, you can use set.seed(1234) or some other integer to create reproducible pseudo-random numbers.

3.3.3.1 Normal Law

Function Description
dnorm(x) normal density function (by default m=0 sd=1)
pnorm(q) cumulative normal probability for q
qnorm(p) normal quantile
rnorm(n, m=0,sd=1) n random normal numbers with mu m, sigma sd
x <- pretty(c(-3,3), 30)
x
##  [1] -3.0 -2.8 -2.6 -2.4 -2.2 -2.0 -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4
## [15] -0.2  0.0  0.2  0.4  0.6  0.8  1.0  1.2  1.4  1.6  1.8  2.0  2.2  2.4
## [29]  2.6  2.8  3.0
y <- dnorm(x)
y
##  [1] 0.00443 0.00792 0.01358 0.02239 0.03547 0.05399 0.07895 0.11092
##  [9] 0.14973 0.19419 0.24197 0.28969 0.33322 0.36827 0.39104 0.39894
## [17] 0.39104 0.36827 0.33322 0.28969 0.24197 0.19419 0.14973 0.11092
## [25] 0.07895 0.05399 0.03547 0.02239 0.01358 0.00792 0.00443
plot(x, y, type='l', xlab="Normal Deviate", ylab="Density", yaxs="i")

pnorm(1.96)
## [1] 0.975
qnorm(.9)
## [1] 1.28

3.3.3.2 Binomial Law

dbinom(x, size, prob), pbinom(q, size, prob), qbinom(p, size, prob), rbinom(n, size, prob)

dbinom(-5:20, 10, 0.5) 
##  [1] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000977 0.009766
##  [8] 0.043945 0.117188 0.205078 0.246094 0.205078 0.117188 0.043945
## [15] 0.009766 0.000977 0.000000 0.000000 0.000000 0.000000 0.000000
## [22] 0.000000 0.000000 0.000000 0.000000 0.000000
pbinom(-5:20, 10, .5)
##  [1] 0.000000 0.000000 0.000000 0.000000 0.000000 0.000977 0.010742
##  [8] 0.054687 0.171875 0.376953 0.623047 0.828125 0.945312 0.989258
## [15] 0.999023 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
## [22] 1.000000 1.000000 1.000000 1.000000 1.000000

3.3.3.3 Poisson Law

poisson distribution with m=std=lamda

** dpois(x, lamda), ppois(q, lamda), qpois(p, lamda), rpois(n, lamda)**

#probability of 0,1, or 2 events with lamda=4
dpois(0:2, 4)
## [1] 0.0183 0.0733 0.1465
# probability of at least 3 events with lamda=4 
1- ppois(2,4)
## [1] 0.762

3.3.3.4 Uniform Law

Uniform distribution, follows the same pattern as the normal distribution above.

dunif(x, min=0, max=1), punif(q, min=0, max=1), qunif(p, min=0, max=1), runif(n, min=0, max=1)

#10 uniform random variates
x <- runif(10)

3.3.4 Other Statistical Functions

Other useful statistical functions are provided in the following table. Each has the option na.rm to strip missing values before calculations. Otherwise the presence of missing values will lead to a missing result. Object can be a numeric vector or data frame.

Function Description
mean(x, trim=0,na.rm=FALSE) mean of object x
sd(x) standard deviation of object(x)
var(x) f variance
median(x) median
quantile(x, probs) quantiles where x is the quantile vector
range(x) range
sum(x) sum
diff(x, lag=1) lagged differences
min(x) minimum
max(x) maximum
scale(x, center=TRUE, scale=TRUE) column center or standardize a matrix
# trimmed mean, removing any missing values and # 5 percent of highest and lowest scores 
mx <- mean(age,trim=.05,na.rm=TRUE)
mx
## [1] 37
# 30th and 84th percentiles of x
y <- quantile(x, c(.3,.84))
y
##   30%   84% 
## 0.594 0.892

3.3.5 Other Useful Functions

Function Description
seq(from , to, by) generate a sequence
rep(x, ntimes) repeat x n times
cut(x, n) divide continuous variable in factor with n levels
indices <- seq(1,10,2)
indices
## [1] 1 3 5 7 9
y <- rep(1:3, 2)
y
## [1] 1 2 3 1 2 3
x
##  [1] 0.73282 0.87162 0.89639 0.70588 0.88684 0.12105 0.98174 0.00742
##  [9] 0.33138 0.73420
z <- cut(x, 5)
z
##  [1] (0.592,0.787]   (0.787,0.983]   (0.787,0.983]   (0.592,0.787]  
##  [5] (0.787,0.983]   (0.00645,0.202] (0.787,0.983]   (0.00645,0.202]
##  [9] (0.202,0.397]   (0.592,0.787]  
## 5 Levels: (0.00645,0.202] (0.202,0.397] (0.397,0.592] ... (0.787,0.983]

3.4 Control Structures

R has the standard control structures you would expect. expr can be multiple (compound) statements by enclosing them in braces { }. It is more efficient to use built-in functions rather than control structures whenever possible.

if-else if (cond) expr if (cond) expr1 else expr2

for for (var in seq) expr

while while (cond) expr

switch switch(expr, …)

ifelse ifelse(test,yes,no)

Example to transpose of a matrix, i.e. a poor alternative to built-in t() function

mytrans <- function(x) { 
  if (!is.matrix(x)) {
    warning("argument is not a matrix: returning NA")
    return(NA_real_)
  }
  y <- matrix(1, nrow=ncol(x), ncol=nrow(x)) 
  for (i in 1:nrow(x)) {
    for (j in 1:ncol(x)) {
      y[j,i] <- x[i,j] 
    }
  }
return(y)
}
mymatrix
##    x1 x2
## R1  1 26
## R2 24 68
mytrans(mymatrix)
##      [,1] [,2]
## [1,]    1   24
## [2,]   26   68

3.5 User-written Functions

One of the great strengths of R is the user’s ability to add functions. In fact, many of the functions in R are actually functions of functions. The structure of a function is given below.

myfunction <- function(arg1, arg2, … ){ statements return(object) }

Objects in the function are local to the function. The object returned can be any data type. Here is an example.

function example - get measures of central tendency and spread for a numeric vector x. The user has a choice of measures and whether the results are printed

mysummary <- function(x,npar=TRUE,print=TRUE) {
  if (!npar) {
    center <- mean(x); spread <- sd(x) 
  } else {
    center <- median(x); spread <- mad(x) 
  }
  if (print & !npar) {
    cat("Mean=", center, "\n", "SD=", spread, "\n")
  } else if (print & npar) {
    cat("Median=", center, "\n", "MAD=", spread, "\n")
  }
  result <- list(center=center,spread=spread)
  return(result)
}
mysummary
## function(x,npar=TRUE,print=TRUE) {
##   if (!npar) {
##     center <- mean(x); spread <- sd(x) 
##   } else {
##     center <- median(x); spread <- mad(x) 
##   }
##   if (print & !npar) {
##     cat("Mean=", center, "\n", "SD=", spread, "\n")
##   } else if (print & npar) {
##     cat("Median=", center, "\n", "MAD=", spread, "\n")
##   }
##   result <- list(center=center,spread=spread)
##   return(result)
## }
# invoking the function 
set.seed(1234)
x <- rpois(500, 4) 
y <- mysummary(x)
## Median= 4 
##  MAD= 1.48
# y$center is the median (4) 
# y$spread is the median absolute deviation (1.48)

It can be instructive to look at the code of a function. In R, you can view a function’s code by typing the function name without the ( ). If this method fails, look at the following R Wiki link for hints on viewing function sourcecode.

Finally, you may want to store your own functions, and have them available in every session. You can customize the R environment to load your functions at start-up.

3.6 Sorting Data

3.6.1 Sorting a vector

Use sort() in order to sort (or order) a vector or factor (partially) into ascending or descending order.

a
## [1]  1.0  2.0  5.3  6.0 -2.0  4.0
sort(a, decreasing = F)
## [1] -2.0  1.0  2.0  4.0  5.3  6.0
sort(a, decreasing = T)
## [1]  6.0  5.3  4.0  2.0  1.0 -2.0

3.6.2 Sorting a dataframe

To sort a data frame in R, use the order( ) function. By default, sorting is ASCENDING. Prepend the sorting variable by a minus sign to indicate DESCENDING order. Below are some examples.

# sorting examples using the mtcars dataset
attach(mtcars)
# sort by cyl
newdata <- mtcars[order(cyl),] 
newdata
# sort by cyl and mpg
newdata <- mtcars[order(cyl, mpg),]
newdata
#sort by cyl (ascending) and mpg (descending)
newdata <- mtcars[order(cyl, -mpg),] 
newdata
detach(mtcars)

3.7 Merging Data

3.7.1 Adding Columns

To merge two data frames (datasets) horizontally, use the merge function. In most cases, you join two data frames by one or more common key variables (i.e., an inner join).

Merge two data frames by ID

data1<-paracou[1:10, 1:4]
data1
data2<-paracou[1:10, 4:8]
data2
total <- merge(data1,data2,by="NumArbre")
total

3.7.2 Adding Rows

To join two data frames (datasets) vertically, use the rbind() function. The two data frames must have the same variables, but they do not have to be in the same order.

data1<-paracou[1:5, 1:5]
data1
data2<-paracou[6:10, 1:5]
data2
total <- rbind(data1,data2)
total

If data frameA has variables that data frameB does not, then either:

Delete the extra variables in data frameA or

Create the additional variables in data frameB and set them to NA (missing) before joining them with rbind( ).

3.8 Aggregating Data

It is relatively easy to collapse data in R using one or more BY variables and a defined function.

Aggregate data frame mtcars by cyl and vs, returning means for numeric variables

attach(mtcars)
aggdata <-aggregate(mtcars, by=list(cyl,vs), 
  FUN=mean, na.rm=TRUE)
print(aggdata)
##   Group.1 Group.2  mpg cyl disp    hp drat   wt qsec vs    am gear carb
## 1       4       0 26.0   4  120  91.0 4.43 2.14 16.7  0 1.000 5.00 2.00
## 2       6       0 20.6   6  155 131.7 3.81 2.75 16.3  0 1.000 4.33 4.67
## 3       8       0 15.1   8  353 209.2 3.23 4.00 16.8  0 0.143 3.29 3.50
## 4       4       1 26.7   4  104  81.8 4.04 2.30 19.4  1 0.700 4.00 1.50
## 5       6       1 19.1   6  205 115.2 3.42 3.39 19.2  1 0.000 3.50 2.50
detach(mtcars)

When using the aggregate() function, the by variables must be in a list (even if there is only one). The function can be built-in or user provided.

See also summarize() in the Hmisc package, summaryBy() in the doBy package

3.9 Reshaping Data

R provides a variety of methods for reshaping data prior to analysis.

3.9.1 Transpose

Use the t() function to transpose a matrix or a data frame. In the later case, rownames become variable (column) names.

mtcars
t(mtcars)
##      Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout
## mpg      21.00         21.00      22.80          21.40             18.70
## cyl       6.00          6.00       4.00           6.00              8.00
## disp    160.00        160.00     108.00         258.00            360.00
## hp      110.00        110.00      93.00         110.00            175.00
## drat      3.90          3.90       3.85           3.08              3.15
## wt        2.62          2.88       2.32           3.21              3.44
## qsec     16.46         17.02      18.61          19.44             17.02
## vs        0.00          0.00       1.00           1.00              0.00
## am        1.00          1.00       1.00           0.00              0.00
## gear      4.00          4.00       4.00           3.00              3.00
## carb      4.00          4.00       1.00           1.00              2.00
##      Valiant Duster 360 Merc 240D Merc 230 Merc 280 Merc 280C Merc 450SE
## mpg    18.10      14.30     24.40    22.80    19.20     17.80      16.40
## cyl     6.00       8.00      4.00     4.00     6.00      6.00       8.00
## disp  225.00     360.00    146.70   140.80   167.60    167.60     275.80
## hp    105.00     245.00     62.00    95.00   123.00    123.00     180.00
## drat    2.76       3.21      3.69     3.92     3.92      3.92       3.07
## wt      3.46       3.57      3.19     3.15     3.44      3.44       4.07
## qsec   20.22      15.84     20.00    22.90    18.30     18.90      17.40
## vs      1.00       0.00      1.00     1.00     1.00      1.00       0.00
## am      0.00       0.00      0.00     0.00     0.00      0.00       0.00
## gear    3.00       3.00      4.00     4.00     4.00      4.00       3.00
## carb    1.00       4.00      2.00     2.00     4.00      4.00       3.00
##      Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
## mpg       17.30       15.20              10.40               10.40
## cyl        8.00        8.00               8.00                8.00
## disp     275.80      275.80             472.00              460.00
## hp       180.00      180.00             205.00              215.00
## drat       3.07        3.07               2.93                3.00
## wt         3.73        3.78               5.25                5.42
## qsec      17.60       18.00              17.98               17.82
## vs         0.00        0.00               0.00                0.00
## am         0.00        0.00               0.00                0.00
## gear       3.00        3.00               3.00                3.00
## carb       3.00        3.00               4.00                4.00
##      Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla Toyota Corona
## mpg              14.70    32.40       30.40          33.90         21.50
## cyl               8.00     4.00        4.00           4.00          4.00
## disp            440.00    78.70       75.70          71.10        120.10
## hp              230.00    66.00       52.00          65.00         97.00
## drat              3.23     4.08        4.93           4.22          3.70
## wt                5.34     2.20        1.61           1.83          2.46
## qsec             17.42    19.47       18.52          19.90         20.01
## vs                0.00     1.00        1.00           1.00          1.00
## am                0.00     1.00        1.00           1.00          0.00
## gear              3.00     4.00        4.00           4.00          3.00
## carb              4.00     1.00        2.00           1.00          1.00
##      Dodge Challenger AMC Javelin Camaro Z28 Pontiac Firebird Fiat X1-9
## mpg             15.50       15.20      13.30            19.20     27.30
## cyl              8.00        8.00       8.00             8.00      4.00
## disp           318.00      304.00     350.00           400.00     79.00
## hp             150.00      150.00     245.00           175.00     66.00
## drat             2.76        3.15       3.73             3.08      4.08
## wt               3.52        3.44       3.84             3.85      1.94
## qsec            16.87       17.30      15.41            17.05     18.90
## vs               0.00        0.00       0.00             0.00      1.00
## am               0.00        0.00       0.00             0.00      1.00
## gear             3.00        3.00       3.00             3.00      4.00
## carb             2.00        2.00       4.00             2.00      1.00
##      Porsche 914-2 Lotus Europa Ford Pantera L Ferrari Dino Maserati Bora
## mpg          26.00        30.40          15.80        19.70         15.00
## cyl           4.00         4.00           8.00         6.00          8.00
## disp        120.30        95.10         351.00       145.00        301.00
## hp           91.00       113.00         264.00       175.00        335.00
## drat          4.43         3.77           4.22         3.62          3.54
## wt            2.14         1.51           3.17         2.77          3.57
## qsec         16.70        16.90          14.50        15.50         14.60
## vs            0.00         1.00           0.00         0.00          0.00
## am            1.00         1.00           1.00         1.00          1.00
## gear          5.00         5.00           5.00         5.00          5.00
## carb          2.00         2.00           4.00         6.00          8.00
##      Volvo 142E
## mpg       21.40
## cyl        4.00
## disp     121.00
## hp       109.00
## drat       4.11
## wt         2.78
## qsec      18.60
## vs         1.00
## am         1.00
## gear       4.00
## carb       2.00

3.9.2 The Reshape Package

Hadley Wickham has created a comprehensive package called reshape to massage data. Basically, you “melt” data so that each row is a unique id-variable combination. Then you “cast” the melted data into any shape you would like. Here is a very simple example.

library(reshape)
mydata
mdata <- melt(mydata, id=c("gender"))
mdata
subjmeans <- cast(mdata, gender~variable, mean)
subjmeans

There is much more that you can do with the melt( ) and cast( ) functions. See the package documentation ??reshape for more details.

3.10 Subsetting Data

R has powerful indexing features for accessing object elements. These features can be used to select and exclude variables and observations. The following code snippets demonstrate ways to keep or delete variables and observations and to take random samples from a dataset.

3.10.1 Selecting (Keeping) Variables

select variables NomVern, Circonference

myvars <- c("NomVern", "Circonference")
newparacou <- paracou[myvars]
head(newparacou)

select 1st and 5th through 10th variables

newparacou <- paracou[,c(1,5:10)]
head(newparacou)

3.10.2 Excluding (DROPPING) Variables

exclude variables v1, v2, v3

myvars <- names(paracou) %in% c("XYsum", "XYmean", "XYsum2", "XYmean2", "sum3", "mean3","newname") 
paracou <- paracou[!myvars]
head(paracou)

exclude 3rd and 5th variable

newdata <- paracou[c(-3,-5)]
head(newdata)

Selecting Observations

newdata <- paracou[1:5,]
newdata

based on variable values

paracou[ which(paracou$Circonference > 37), ]

Selection using the Subset Function

The subset( ) function is the easiest way to select variables and observations. In the following example, we select all trees over the circonference of 35 and with tree numbers over 2970. And we keep NomVern.

using subset function

newdata <- subset(paracou, Circonference >= 35 & NumArbre > 2970,
                  select=c(NomVern, Circonference, NumArbre))
newdata

3.10.3 Random Samples

Use the sample( ) function to take a random sample of size n from a dataset.

cells
## [1]  1 26 24 68
sample(cells, 2, replace = F)
## [1] 24 68
sample(cells, 2, replace = F)
## [1] 68  1

** TRY IT!! take a random sample of size 10 from the paracou dataset**

mysample <- paracou[sample(1:nrow(paracou), 10, replace=FALSE),]
mysample

3.11 Data Type Conversion

Type conversions in R work as you would expect. For example, adding a character string to a numeric vector converts all the elements in the vector to character.

Use class() to test for data class.

Use the following for specific tests

is.numeric(), is.character(), is.vector(), is.matrix(), is.data.frame()

Use the following to convert

as.numeric(), as.character(), as.vector(), as.matrix(), as.data.frame()

to one long vector to matrix to data frame
from vector c(x,y) cbind(x,y) OR rbind(x,y) data.frame(x,y)
from matrix as.vector(mymatrix) as.data.frame(mymatrix)
from data frame as.matrix(myframe)

4 Basic Statistics

This section describes basic (and not so basic) statistics. It includes code for obtaining descriptive statistics, frequency counts and crosstabulations (including tests of independence), correlations (pearson, spearman, kendall, polychoric), t-tests (with equal and unequal variances), nonparametric tests of group differences (Mann Whitney U, Wilcoxon Signed Rank, Kruskall Wallis Test, Friedman Test), multiple linear regression (including diagnostics, cross-validation and variable selection), analysis of variance (including ANCOVA and MANOVA), and statistics based on resampling.

Since modern data analyses almost always involve graphical assessments of relationships and assumptions, links to appropriate graphical methods are provided throughout.

It is always important to check model assumptions before making statistical inferences. Although it is somewhat artificial to separate regression modeling and an ANOVA framework in this regard, many people learn these topics separately, so I’ve followed the same convention here.

Regression diagnostics cover outliers, influential observations, non-normality, non-constant error variance, multicolinearity, nonlinearity, and non-independence of errors. Classical test assumptions for ANOVA/ANCOVA/MANCOVA include the assessment of normality and homogeneity of variances in the univariate case, and multivariate normality and homogeneity of covariance matrices in the multivariate case. The identification of multivariate outliers is also considered.

Power analysis provides methods of statistical power analysis and sample size estimation for a variety of designs.

Finally, two functions that aid in efficient processing (with and by) are described.

More advanced statistical modeling can be found in the Advanced Statistics section. mydata

4.1 Descriptive Statistics

R provides a wide range of functions for obtaining summary statistics. One method of obtaining descriptive statistics is to use the sapply( ) function with a specified summary statistic.

  • Get means for variables in data frame mydata and excluding missing values *
sapply(mtcars, mean, na.rm=TRUE)
##     mpg     cyl    disp      hp    drat      wt    qsec      vs      am 
##  20.091   6.188 230.722 146.688   3.597   3.217  17.849   0.438   0.406 
##    gear    carb 
##   3.688   2.812

Possible functions used in sapply include mean, sd, var, min, max, median, range, and quantile. There are also numerous R functions designed to provide a range of descriptive statistics at once. For example

mean,median,25th and 75th quartiles,min,max

summary(mtcars)
##       mpg            cyl            disp           hp           drat     
##  Min.   :10.4   Min.   :4.00   Min.   : 71   Min.   : 52   Min.   :2.76  
##  1st Qu.:15.4   1st Qu.:4.00   1st Qu.:121   1st Qu.: 96   1st Qu.:3.08  
##  Median :19.2   Median :6.00   Median :196   Median :123   Median :3.69  
##  Mean   :20.1   Mean   :6.19   Mean   :231   Mean   :147   Mean   :3.60  
##  3rd Qu.:22.8   3rd Qu.:8.00   3rd Qu.:326   3rd Qu.:180   3rd Qu.:3.92  
##  Max.   :33.9   Max.   :8.00   Max.   :472   Max.   :335   Max.   :4.93  
##        wt            qsec            vs              am       
##  Min.   :1.51   Min.   :14.5   Min.   :0.000   Min.   :0.000  
##  1st Qu.:2.58   1st Qu.:16.9   1st Qu.:0.000   1st Qu.:0.000  
##  Median :3.33   Median :17.7   Median :0.000   Median :0.000  
##  Mean   :3.22   Mean   :17.9   Mean   :0.438   Mean   :0.406  
##  3rd Qu.:3.61   3rd Qu.:18.9   3rd Qu.:1.000   3rd Qu.:1.000  
##  Max.   :5.42   Max.   :22.9   Max.   :1.000   Max.   :1.000  
##       gear           carb     
##  Min.   :3.00   Min.   :1.00  
##  1st Qu.:3.00   1st Qu.:2.00  
##  Median :4.00   Median :2.00  
##  Mean   :3.69   Mean   :2.81  
##  3rd Qu.:4.00   3rd Qu.:4.00  
##  Max.   :5.00   Max.   :8.00

Tukey min,lower-hinge, median,upper-hinge,max

fivenum(x)
## [1]  0  3  4  5 11

Using the Hmisc package

library(Hmisc)
describe(x) 
## x 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      500        0       12    0.973    4.052     2.21        1        2 
##      .25      .50      .75      .90      .95 
##        3        4        5        7        8 
##                                                                       
## Value          0     1     2     3     4     5     6     7     8     9
## Frequency      8    29    71   110    99    87    36    28    15    11
## Proportion 0.016 0.058 0.142 0.220 0.198 0.174 0.072 0.056 0.030 0.022
##                       
## Value         10    11
## Frequency      3     3
## Proportion 0.006 0.006

Using the pastecs package

library(pastecs)
stat.desc(mtcars) 

Using the psych package

library(psych)
#describe(mtcars)

Summary Statistics by Group

A simple way of generating summary statistics by grouping variable is available in the psych package.

describe.by(mtcars, mtcars$cyl)
## Warning: describe.by is deprecated. Please use the describeBy function
## 
##  Descriptive statistics by group 
## group: 4
##      vars  n   mean    sd median trimmed   mad   min    max range  skew
## mpg     1 11  26.66  4.51  26.00   26.44  6.52 21.40  33.90 12.50  0.26
## cyl     2 11   4.00  0.00   4.00    4.00  0.00  4.00   4.00  0.00   NaN
## disp    3 11 105.14 26.87 108.00  104.30 43.00 71.10 146.70 75.60  0.12
## hp      4 11  82.64 20.93  91.00   82.67 32.62 52.00 113.00 61.00  0.01
## drat    5 11   4.07  0.37   4.08    4.02  0.34  3.69   4.93  1.24  1.00
## wt      6 11   2.29  0.57   2.20    2.27  0.54  1.51   3.19  1.68  0.30
## qsec    7 11  19.14  1.68  18.90   18.99  1.48 16.70  22.90  6.20  0.55
## vs      8 11   0.91  0.30   1.00    1.00  0.00  0.00   1.00  1.00 -2.47
## am      9 11   0.73  0.47   1.00    0.78  0.00  0.00   1.00  1.00 -0.88
## gear   10 11   4.09  0.54   4.00    4.11  0.00  3.00   5.00  2.00  0.11
## carb   11 11   1.55  0.52   2.00    1.56  0.00  1.00   2.00  1.00 -0.16
##      kurtosis   se
## mpg     -1.65 1.36
## cyl       NaN 0.00
## disp    -1.64 8.10
## hp      -1.71 6.31
## drat     0.12 0.11
## wt      -1.36 0.17
## qsec    -0.02 0.51
## vs       4.52 0.09
## am      -1.31 0.14
## gear    -0.01 0.16
## carb    -2.15 0.16
## -------------------------------------------------------- 
## group: 6
##      vars n   mean    sd median trimmed   mad    min    max  range  skew
## mpg     1 7  19.74  1.45  19.70   19.74  1.93  17.80  21.40   3.60 -0.16
## cyl     2 7   6.00  0.00   6.00    6.00  0.00   6.00   6.00   0.00   NaN
## disp    3 7 183.31 41.56 167.60  183.31 11.27 145.00 258.00 113.00  0.80
## hp      4 7 122.29 24.26 110.00  122.29  7.41 105.00 175.00  70.00  1.36
## drat    5 7   3.59  0.48   3.90    3.59  0.03   2.76   3.92   1.16 -0.74
## wt      6 7   3.12  0.36   3.21    3.12  0.36   2.62   3.46   0.84 -0.22
## qsec    7 7  17.98  1.71  18.30   17.98  1.90  15.50  20.22   4.72 -0.12
## vs      8 7   0.57  0.53   1.00    0.57  0.00   0.00   1.00   1.00 -0.23
## am      9 7   0.43  0.53   0.00    0.43  0.00   0.00   1.00   1.00  0.23
## gear   10 7   3.86  0.69   4.00    3.86  0.00   3.00   5.00   2.00  0.11
## carb   11 7   3.43  1.81   4.00    3.43  0.00   1.00   6.00   5.00 -0.26
##      kurtosis    se
## mpg     -1.91  0.55
## cyl       NaN  0.00
## disp    -1.23 15.71
## hp       0.25  9.17
## drat    -1.40  0.18
## wt      -1.98  0.13
## qsec    -1.75  0.65
## vs      -2.20  0.20
## am      -2.20  0.20
## gear    -1.24  0.26
## carb    -1.50  0.69
## -------------------------------------------------------- 
## group: 8
##      vars  n   mean    sd median trimmed   mad    min    max  range  skew
## mpg     1 14  15.10  2.56  15.20   15.15  1.56  10.40  19.20   8.80 -0.36
## cyl     2 14   8.00  0.00   8.00    8.00  0.00   8.00   8.00   0.00   NaN
## disp    3 14 353.10 67.77 350.50  349.63 73.39 275.80 472.00 196.20  0.45
## hp      4 14 209.21 50.98 192.50  203.67 44.48 150.00 335.00 185.00  0.91
## drat    5 14   3.23  0.37   3.12    3.19  0.16   2.76   4.22   1.46  1.34
## wt      6 14   4.00  0.76   3.75    3.95  0.41   3.17   5.42   2.25  0.99
## qsec    7 14  16.77  1.20  17.18   16.86  0.79  14.50  18.00   3.50 -0.80
## vs      8 14   0.00  0.00   0.00    0.00  0.00   0.00   0.00   0.00   NaN
## am      9 14   0.14  0.36   0.00    0.08  0.00   0.00   1.00   1.00  1.83
## gear   10 14   3.29  0.73   3.00    3.17  0.00   3.00   5.00   2.00  1.83
## carb   11 14   3.50  1.56   3.50    3.25  0.74   2.00   8.00   6.00  1.48
##      kurtosis    se
## mpg     -0.57  0.68
## cyl       NaN  0.00
## disp    -1.26 18.11
## hp       0.09 13.62
## drat     1.08  0.10
## wt      -0.71  0.20
## qsec    -0.92  0.32
## vs        NaN  0.00
## am       1.45  0.10
## gear     1.45  0.19
## carb     2.24  0.42

The doBy package provides many functionalities. It defines the desired table using a model formula and a function.

4.2 Frequencies and Crosstabs

This section describes the creation of frequency and contingency tables from categorical variables, along with tests of independence, measures of association, and methods for graphically displaying results.

4.2.1 Generating Frequency Tables

R provides many methods for creating frequency and contingency tables. Three are described below. In the following examples, assume that A, B, and C represent categorical variables.

Table

You can generate frequency tables using the table( ) function, tables of proportions using the prop.table( ) function, and marginal frequencies using margin.table( ).

attach(mtcars)
## The following object is masked from package:ggplot2:
## 
##     mpg
mytable <- table(cyl,carb)
mytable
##    carb
## cyl 1 2 3 4 6 8
##   4 5 6 0 0 0 0
##   6 2 0 0 4 1 0
##   8 0 4 3 6 0 1
margin.table(mytable, 1)
## cyl
##  4  6  8 
## 11  7 14
margin.table(mytable, 2)
## carb
##  1  2  3  4  6  8 
##  7 10  3 10  1  1
prop.table(mytable) # cell percentages
##    carb
## cyl      1      2      3      4      6      8
##   4 0.1562 0.1875 0.0000 0.0000 0.0000 0.0000
##   6 0.0625 0.0000 0.0000 0.1250 0.0312 0.0000
##   8 0.0000 0.1250 0.0938 0.1875 0.0000 0.0312
prop.table(mytable, 1) # row percentages 
##    carb
## cyl      1      2      3      4      6      8
##   4 0.4545 0.5455 0.0000 0.0000 0.0000 0.0000
##   6 0.2857 0.0000 0.0000 0.5714 0.1429 0.0000
##   8 0.0000 0.2857 0.2143 0.4286 0.0000 0.0714
prop.table(mytable, 2) # column percentages
##    carb
## cyl     1     2     3     4     6     8
##   4 0.714 0.600 0.000 0.000 0.000 0.000
##   6 0.286 0.000 0.000 0.400 1.000 0.000
##   8 0.000 0.400 1.000 0.600 0.000 1.000

table( ) can also generate multidimensional tables based on 3 or more categorical variables. In this case, use the ftable( ) function to print the results more attractively.

mytable <- table(cyl, carb, am) 
ftable(mytable)
##          am 0 1
## cyl carb       
## 4   1       1 4
##     2       2 4
##     3       0 0
##     4       0 0
##     6       0 0
##     8       0 0
## 6   1       2 0
##     2       0 0
##     3       0 0
##     4       2 2
##     6       0 1
##     8       0 0
## 8   1       0 0
##     2       4 0
##     3       3 0
##     4       5 1
##     6       0 0
##     8       0 1

Table ignores missing values. To include NA as a category in counts, include the table option exclude=NULL if the variable is a vector. If the variable is a factor you have to create a new factor using newfactor <- factor(oldfactor, exclude=NULL).

xtabs

The xtabs( ) function allows you to create crosstabulations using formula style input.

mytable <- xtabs(~cyl+carb+am, data=mydata)
ftable(mytable) # print table 
##          am 0 1
## cyl carb       
## 4   1       1 4
##     2       2 4
##     3       0 0
##     4       0 0
##     6       0 0
##     8       0 0
## 6   1       2 0
##     2       0 0
##     3       0 0
##     4       2 2
##     6       0 1
##     8       0 0
## 8   1       0 0
##     2       4 0
##     3       3 0
##     4       5 1
##     6       0 0
##     8       0 1
summary(mytable) # chi-square test of indepedence
## Call: xtabs(formula = ~cyl + carb + am, data = mydata)
## Number of cases in table: 32 
## Number of factors: 3 
## Test for independence of all factors:
##  Chisq = 55, df = 27, p-value = 0.001
##  Chi-squared approximation may be incorrect

If a variable is included on the left side of the formula, it is assumed to be a vector of frequencies (useful if the data have already been tabulated).

4.2.2 Tests of Independence

4.2.2.1 Chi-Square Test

For 2-way tables you can use chisq.test(mytable) to test independence of the row and column variable. By default, the p-value is calculated from the asymptotic chi-squared distribution of the test statistic. Optionally, the p-value can be derived via Monte Carlo simultation.

mytable <- table(cyl,carb)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable
## X-squared = 20, df = 10, p-value = 0.007

4.2.2.2 Fisher Exact Test

fisher.test(x) provides an exact test of independence. x is a two dimensional contingency table in matrix form.

mytable <- table(cyl,carb)
fisher.test(mytable)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  mytable
## p-value = 3e-04
## alternative hypothesis: two.sided

4.2.2.3 Mantel-Haenszel test

Use the mantelhaen.test(x) function to perform a Cochran-Mantel-Haenszel chi-squared test of the null hypothesis that two nominal variables are conditionally independent in each stratum, assuming that there is no three-way interaction. x is a 3 dimensional contingency table, where the last dimension refers to the strata.

4.2.2.4 Loglinear Models

You can use the loglm( ) function in the MASS package to produce log-linear models. For example, let’s assume we have a 3-way contingency table based on variables cyl, carb, and am. We can perform the following tests:

library(MASS)
mytable <- xtabs(~cyl+carb+am, data=mtcars)
#Mutual Independence: cyl, carb, and am are pairwise independent.
loglm(~cyl+carb+am, mytable)
## Call:
## loglm(formula = ~cyl + carb + am, data = mytable)
## 
## Statistics:
##                   X^2 df P(> X^2)
## Likelihood Ratio 52.3 27  0.00247
## Pearson          55.2 27  0.00107
#Partial Independence: cyl is partially independent of carb and am 
#(i.e., cyl is independent of the composite variable carbam)
loglm(~cyl+carb+am+carb*am, mytable)
## Call:
## loglm(formula = ~cyl + carb + am + carb * am, data = mytable)
## 
## Statistics:
##                   X^2 df P(> X^2)
## Likelihood Ratio 44.3 22  0.00329
## Pearson           NaN 22      NaN
#Conditional Independence: cyl is independent of carb, given am.
loglm(~cyl+carb+am+cyl*am+carb*am, mytable)
## Call:
## loglm(formula = ~cyl + carb + am + cyl * am + carb * am, data = mytable)
## 
## Statistics:
##                  X^2 df P(> X^2)
## Likelihood Ratio  35 20   0.0203
## Pearson          NaN 20      NaN
#No Three-Way Interaction
loglm(~cyl+carb+am+cyl*carb+cyl*am+carb*am, mytable)
## Call:
## loglm(formula = ~cyl + carb + am + cyl * carb + cyl * am + carb * 
##     am, data = mytable)
## 
## Statistics:
##                    X^2 df P(> X^2)
## Likelihood Ratio 0.327 10        1
## Pearson            NaN 10      NaN

4.2.3 Measures of Association

The assocstats(mytable) function in the vcd package calculates the phi coefficient, contingency coefficient, and Cramer’s V for an rxc table. The kappa(mytable) function in the vcd package calculates Cohen’s kappa and weighted kappa for a confusion matrix.

4.3 Correlations

You can use the cor( ) function to produce correlations and the cov( ) function to produces covariances.

A simplified format is cor(x, use=, method= ) where

x is a matrix or data frame

use Specifies the handling of missing data. Options are all.obs (assumes no missing data - missing data will produce an error), complete.obs (listwise deletion), and pairwise.complete.obs (pairwise deletion)

method Specifies the type of correlation. Options are pearson, spearman or kendall.

# Correlations/covariances among numeric variables in 
# data frame mtcars. Use listwise deletion of missing data. 
cor(mtcars, use="complete.obs", method="kendall") 
##         mpg    cyl   disp     hp    drat     wt    qsec     vs      am
## mpg   1.000 -0.795 -0.768 -0.743  0.4645 -0.728  0.3154  0.590  0.4690
## cyl  -0.795  1.000  0.814  0.785 -0.5513  0.728 -0.4490 -0.771 -0.4946
## disp -0.768  0.814  1.000  0.666 -0.4990  0.743 -0.3008 -0.603 -0.5203
## hp   -0.743  0.785  0.666  1.000 -0.3826  0.611 -0.4729 -0.631 -0.3040
## drat  0.465 -0.551 -0.499 -0.383  1.0000 -0.547  0.0327  0.375  0.5755
## wt   -0.728  0.728  0.743  0.611 -0.5471  1.000 -0.1420 -0.488 -0.6139
## qsec  0.315 -0.449 -0.301 -0.473  0.0327 -0.142  1.0000  0.658 -0.1689
## vs    0.590 -0.771 -0.603 -0.631  0.3751 -0.488  0.6575  1.000  0.1683
## am    0.469 -0.495 -0.520 -0.304  0.5755 -0.614 -0.1689  0.168  1.0000
## gear  0.433 -0.513 -0.476 -0.279  0.5839 -0.544 -0.0913  0.270  0.7708
## carb -0.504  0.465  0.414  0.596 -0.0954  0.371 -0.5064 -0.577 -0.0586
##         gear    carb
## mpg   0.4332 -0.5044
## cyl  -0.5125  0.4654
## disp -0.4760  0.4137
## hp   -0.2794  0.5960
## drat  0.5839 -0.0954
## wt   -0.5436  0.3714
## qsec -0.0913 -0.5064
## vs    0.2697 -0.5769
## am    0.7708 -0.0586
## gear  1.0000  0.0980
## carb  0.0980  1.0000
cov(mtcars, use="complete.obs")
##          mpg     cyl    disp      hp     drat      wt     qsec       vs
## mpg    36.32  -9.172  -633.1 -320.73   2.1951  -5.117   4.5091   2.0171
## cyl    -9.17   3.190   199.7  101.93  -0.6684   1.367  -1.8869  -0.7298
## disp -633.10 199.660 15360.8 6721.16 -47.0640 107.684 -96.0517 -44.3776
## hp   -320.73 101.931  6721.2 4700.87 -16.4511  44.193 -86.7701 -24.9879
## drat    2.20  -0.668   -47.1  -16.45   0.2859  -0.373   0.0871   0.1186
## wt     -5.12   1.367   107.7   44.19  -0.3727   0.957  -0.3055  -0.2737
## qsec    4.51  -1.887   -96.1  -86.77   0.0871  -0.305   3.1932   0.6706
## vs      2.02  -0.730   -44.4  -24.99   0.1186  -0.274   0.6706   0.2540
## am      1.80  -0.466   -36.6   -8.32   0.1902  -0.338  -0.2050   0.0423
## gear    2.14  -0.649   -50.8   -6.36   0.2760  -0.421  -0.2804   0.0766
## carb   -5.36   1.520    79.1   83.04  -0.0784   0.676  -1.8941  -0.4637
##            am     gear    carb
## mpg    1.8039   2.1357 -5.3631
## cyl   -0.4657  -0.6492  1.5202
## disp -36.5640 -50.8026 79.0687
## hp    -8.3206  -6.3589 83.0363
## drat   0.1902   0.2760 -0.0784
## wt    -0.3381  -0.4211  0.6758
## qsec  -0.2050  -0.2804 -1.8941
## vs     0.0423   0.0766 -0.4637
## am     0.2490   0.2923  0.0464
## gear   0.2923   0.5444  0.3266
## carb   0.0464   0.3266  2.6089

Unfortunately, neither cor( ) or cov( ) produce tests of significance, although you can use the cor.test( ) function to test a single correlation coefficient.

The rcorr( ) function in the Hmisc package produces correlations/covariances and significance levels for pearson and spearman correlations. However, input must be a matrix and pairwise deletion is used.

# Correlations with significance levels
rcorr(as.matrix(mtcars))
##        mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
## mpg   1.00 -0.85 -0.85 -0.78  0.68 -0.87  0.42  0.66  0.60  0.48 -0.55
## cyl  -0.85  1.00  0.90  0.83 -0.70  0.78 -0.59 -0.81 -0.52 -0.49  0.53
## disp -0.85  0.90  1.00  0.79 -0.71  0.89 -0.43 -0.71 -0.59 -0.56  0.39
## hp   -0.78  0.83  0.79  1.00 -0.45  0.66 -0.71 -0.72 -0.24 -0.13  0.75
## drat  0.68 -0.70 -0.71 -0.45  1.00 -0.71  0.09  0.44  0.71  0.70 -0.09
## wt   -0.87  0.78  0.89  0.66 -0.71  1.00 -0.17 -0.55 -0.69 -0.58  0.43
## qsec  0.42 -0.59 -0.43 -0.71  0.09 -0.17  1.00  0.74 -0.23 -0.21 -0.66
## vs    0.66 -0.81 -0.71 -0.72  0.44 -0.55  0.74  1.00  0.17  0.21 -0.57
## am    0.60 -0.52 -0.59 -0.24  0.71 -0.69 -0.23  0.17  1.00  0.79  0.06
## gear  0.48 -0.49 -0.56 -0.13  0.70 -0.58 -0.21  0.21  0.79  1.00  0.27
## carb -0.55  0.53  0.39  0.75 -0.09  0.43 -0.66 -0.57  0.06  0.27  1.00
## 
## n= 32 
## 
## 
## P
##      mpg    cyl    disp   hp     drat   wt     qsec   vs     am     gear  
## mpg         0.0000 0.0000 0.0000 0.0000 0.0000 0.0171 0.0000 0.0003 0.0054
## cyl  0.0000        0.0000 0.0000 0.0000 0.0000 0.0004 0.0000 0.0022 0.0042
## disp 0.0000 0.0000        0.0000 0.0000 0.0000 0.0131 0.0000 0.0004 0.0010
## hp   0.0000 0.0000 0.0000        0.0100 0.0000 0.0000 0.0000 0.1798 0.4930
## drat 0.0000 0.0000 0.0000 0.0100        0.0000 0.6196 0.0117 0.0000 0.0000
## wt   0.0000 0.0000 0.0000 0.0000 0.0000        0.3389 0.0010 0.0000 0.0005
## qsec 0.0171 0.0004 0.0131 0.0000 0.6196 0.3389        0.0000 0.2057 0.2425
## vs   0.0000 0.0000 0.0000 0.0000 0.0117 0.0010 0.0000        0.3570 0.2579
## am   0.0003 0.0022 0.0004 0.1798 0.0000 0.0000 0.2057 0.3570        0.0000
## gear 0.0054 0.0042 0.0010 0.4930 0.0000 0.0005 0.2425 0.2579 0.0000       
## carb 0.0011 0.0019 0.0253 0.0000 0.6212 0.0146 0.0000 0.0007 0.7545 0.1290
##      carb  
## mpg  0.0011
## cyl  0.0019
## disp 0.0253
## hp   0.0000
## drat 0.6212
## wt   0.0146
## qsec 0.0000
## vs   0.0007
## am   0.7545
## gear 0.1290
## carb

You can use the format cor(X, Y) or rcorr(X, Y) to generate correlations between the columns of X and the columns of Y.

# Correlation matrix from mtcars
# with mpg, cyl, and disp as rows 
# and hp, drat, and wt as columns 
x <- mtcars[1:3]
y <- mtcars[4:6]
cor(x, y)
##          hp   drat     wt
## mpg  -0.776  0.681 -0.868
## cyl   0.832 -0.700  0.782
## disp  0.791 -0.710  0.888

4.4 t-tests

The t.test( ) function produces a variety of t-tests. Unlike most statistical packages, the default assumes unequal variance and applies the Welsh df modification.

4.4.1 Independent 2-group t-test

t.test(y~x), where y is numeric and x is a binary factor

# independent 2-group t-test
t.test(rnorm(10,0,1),rnorm(10,1,1)) # where y1 and y2 are numeric
## 
##  Welch Two Sample t-test
## 
## data:  rnorm(10, 0, 1) and rnorm(10, 1, 1)
## t = -4, df = 20, p-value = 6e-04
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.605 -0.858
## sample estimates:
## mean of x mean of y 
##    -0.211     1.521
# paired t-test
t.test(rnorm(10,0,1),rnorm(10,1,1),paired=TRUE) # where y1 & y2 are numeric
## 
##  Paired t-test
## 
## data:  rnorm(10, 0, 1) and rnorm(10, 1, 1)
## t = -2, df = 9, p-value = 0.06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.0434  0.0693
## sample estimates:
## mean of the differences 
##                  -0.987

4.4.2 One sample t-test

t.test(y,mu=3) # Ho: mu=3
## 
##  One Sample t-test
## 
## data:  y
## t = 6, df = 100, p-value = 3e-08
## alternative hypothesis: true mean is not equal to 3
## 95 percent confidence interval:
##  35.3 67.0
## sample estimates:
## mean of x 
##      51.2

You can use the var.equal = TRUE option to specify equal variances and a pooled variance estimate. You can use the alternative=“less” or alternative=“greater” option to specify a one tailed test.

Nonparametric and resampling alternatives to t-tests are available.

4.5 Non-Parametric Tests of Group Differences

R provides functions for carrying out Mann-Whitney U, Wilcoxon Signed Rank, Kruskal Wallis, and Friedman tests.

4.5.1 Independent 2-group Mann-Whitney U Test

wilcox.test(paracou$Circonference~rbinom(19,1,0.5)) 
## Warning in wilcox.test.default(x = c(36, 33.5, 32.5, 34.5, 33, 32.5, 36, :
## cannot compute exact p-value with ties
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  paracou$Circonference by rbinom(19, 1, 0.5)
## W = 30, p-value = 0.4
## alternative hypothesis: true location shift is not equal to 0
# where paracou$Circonference is numeric and rbinom(20,1,0.5) is binary factor

4.5.2 Independent 2-group Mann-Whitney U Test

wilcox.test(paracou$Circonference[1:9],paracou$Circonference[11:19])
## Warning in wilcox.test.default(paracou$Circonference[1:9], paracou
## $Circonference[11:19]): cannot compute exact p-value with ties
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  paracou$Circonference[1:9] and paracou$Circonference[11:19]
## W = 40, p-value = 0.9
## alternative hypothesis: true location shift is not equal to 0
# where y and x are numeric

4.5.3 Dependent 2-group Wilcoxon Signed Rank Test

wilcox.test(paracou$Circonference[1:9],paracou$Circonference[11:19],paired=TRUE)
## Warning in wilcox.test.default(paracou$Circonference[1:9], paracou
## $Circonference[11:19], : cannot compute exact p-value with ties
## Warning in wilcox.test.default(paracou$Circonference[1:9], paracou
## $Circonference[11:19], : cannot compute exact p-value with zeroes
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  paracou$Circonference[1:9] and paracou$Circonference[11:19]
## V = 20, p-value = 0.9
## alternative hypothesis: true location shift is not equal to 0
# where y and x are numeric

4.5.4 Kruskal Wallis Test One Way Anova by Ranks

kruskal.test(mpg~gear) 
## 
##  Kruskal-Wallis rank sum test
## 
## data:  mpg by gear
## Kruskal-Wallis chi-squared = 10, df = 2, p-value = 8e-04
# where y is numeric and x is a factor

For the wilcox.test you can use the alternative=“less” or alternative=“greater” option to specify a one tailed test.

Parametric and resampling alternatives are available.

4.6 Multiple Regressions

R provides comprehensive support for multiple linear regression. The topics below are provided in order of increasing complexity.

4.6.1 Fitting the Model

Multiple Linear Regression Example

fit <- lm(Circonference ~ X + Y + NumArbre, data=paracou)
summary(fit) # show results
## 
## Call:
## lm(formula = Circonference ~ X + Y + NumArbre, data = paracou)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.249 -1.314 -0.465  1.398  3.365 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -219.3771   283.2475   -0.77     0.45
## X             -0.0182     0.0120   -1.51     0.15
## Y             -0.0112     0.0199   -0.57     0.58
## NumArbre       0.0864     0.0962    0.90     0.38
## 
## Residual standard error: 1.88 on 15 degrees of freedom
## Multiple R-squared:  0.168,  Adjusted R-squared:  0.00215 
## F-statistic: 1.01 on 3 and 15 DF,  p-value: 0.414

Other useful functions

coefficients(fit)           # model coefficients
## (Intercept)           X           Y    NumArbre 
##   -219.3771     -0.0182     -0.0112      0.0864
confint(fit, level=0.95)    # CIs for model parameters 
##                 2.5 %   97.5 %
## (Intercept) -823.1050 3.84e+02
## X             -0.0438 7.45e-03
## Y             -0.0536 3.11e-02
## NumArbre      -0.1186 2.91e-01
fitted(fit)                 # predicted values
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 35.0 35.1 34.2 34.6 33.8 34.0 33.8 33.2 32.9 32.9 33.5 34.3 33.5 32.0 34.6 
##   16   17   18   19 
## 34.2 34.3 34.1 34.0
residuals(fit)              # residuals
##       1       2       3       4       5       6       7       8       9 
##  1.0465 -1.1168 -2.2492  3.3655 -1.3026 -0.4646 -1.3263  1.2678  0.0791 
##      10      11      12      13      14      15      16      17      18 
## -0.8847 -0.9792  1.6668  1.0293  1.5289  2.3893 -2.1566 -1.7810  1.8901 
##      19 
## -2.0024
anova(fit)                  # anova table 
vcov(fit)                   # covariance matrix for model parameters 
##             (Intercept)         X         Y  NumArbre
## (Intercept)   80229.173  2.07e-01  3.11e+00 -2.72e+01
## X                 0.207  1.45e-04 -2.43e-05 -7.11e-05
## Y                 3.112 -2.43e-05  3.94e-04 -1.07e-03
## NumArbre        -27.244 -7.11e-05 -1.07e-03  9.25e-03
influence(fit)              # regression diagnostics
## $hat
##      1      2      3      4      5      6      7      8      9     10 
## 0.2470 0.2530 0.1529 0.1424 0.1187 0.1509 0.1714 0.1396 0.1906 0.2172 
##     11     12     13     14     15     16     17     18     19 
## 0.3536 0.2376 0.0802 0.4502 0.2135 0.1424 0.2107 0.3122 0.2159 
## 
## $coefficients
##    (Intercept)         X         Y  NumArbre
## 1      -77.833 -0.001882 -9.41e-04  2.64e-02
## 2       75.938  0.002738  7.64e-04 -2.57e-02
## 3      122.275 -0.000412  3.33e-03 -4.15e-02
## 4     -148.189 -0.003792 -1.90e-03  5.02e-02
## 5       19.669 -0.000272 -2.22e-03 -6.53e-03
## 6        0.273  0.000317 -1.48e-03 -2.08e-05
## 7      -15.853  0.000864 -5.35e-03  5.62e-03
## 8       17.170  0.001344  3.55e-03 -6.00e-03
## 9       -0.387  0.000230 -6.52e-05  1.32e-04
## 10       4.063 -0.002865  1.50e-03 -1.43e-03
## 11      42.477 -0.003280  7.38e-03 -1.47e-02
## 12     -54.785  0.000294 -9.72e-03  1.91e-02
## 13      18.791  0.000777 -1.11e-04 -6.33e-03
## 14     174.637  0.007248  1.19e-02 -5.97e-02
## 15      -0.911 -0.003466 -8.73e-03  9.33e-04
## 16     -70.399  0.002599  1.20e-04  2.36e-02
## 17     -89.494  0.003756 -2.19e-03  3.02e-02
## 18     163.381 -0.005232  7.87e-03 -5.54e-02
## 19     -97.342  0.001857  2.13e-03  3.26e-02
## 
## $sigma
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 1.92 1.92 1.83 1.69 1.91 1.94 1.91 1.91 1.95 1.93 1.92 1.88 1.93 1.87 1.81 
##   16   17   18   19 
## 1.84 1.87 1.85 1.85 
## 
## $wt.res
##       1       2       3       4       5       6       7       8       9 
##  1.0465 -1.1168 -2.2492  3.3655 -1.3026 -0.4646 -1.3263  1.2678  0.0791 
##      10      11      12      13      14      15      16      17      18 
## -0.8847 -0.9792  1.6668  1.0293  1.5289  2.3893 -2.1566 -1.7810  1.8901 
##      19 
## -2.0024

4.6.2 Diagnostic plots

Diagnostic plots provide checks for heteroscedasticity, normality, and influential observerations.

layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page 
plot(fit)

4.6.3 Comparing Models

You can compare nested models with the anova( ) function.

# compare models
fit1 <- lm(Circonference ~ X + Y + NumArbre, data=paracou)
fit2 <- lm(Circonference ~ X + Y , data=paracou)
anova(fit1, fit2)

4.6.4 Cross Validation

You can do K-Fold cross-validation using the cv.lm( ) function in the DAAG package.

library(DAAG)
CVlm(data=nihills, form.lm=formula(log(time)~log(climb)+log(dist)),
          plotit="Observed")
## Analysis of Variance Table
## 
## Response: log(time)
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## log(climb)  1   5.94    5.94    1013 < 2e-16 ***
## log(dist)   1   0.89    0.89     152 8.2e-11 ***
## Residuals  20   0.12    0.01                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## 
## fold 1 
## Observations in test set: 7 
##             Binevenagh Slieve Gullion McVeigh Classic Slieve Martin
## Predicted      -0.1129        -0.7174         -0.5623       -0.5310
## cvpred         -0.0735        -0.6986         -0.5336       -0.5337
## log(time)      -0.1528        -0.7621         -0.6141       -0.5968
## CV residual    -0.0793        -0.0636         -0.0805       -0.0631
##             Hen & Cock Slieve Bearnagh Lurig Challenge
## Predicted       -0.931        -0.33834         -0.7992
## cvpred          -0.975        -0.38052         -0.7779
## log(time)       -0.799        -0.37429         -0.8330
## CV residual      0.176         0.00623         -0.0551
## 
## Sum of squares = 0.05    Mean square = 0.01    n = 7 
## 
## fold 2 
## Observations in test set: 8 
##             Donard & Commedagh Annalong Horseshoe Monument Race   Rocky
## Predicted               0.1184             0.7063       -0.7992 -0.6770
## cvpred                  0.1319             0.7327       -0.8194 -0.6900
## log(time)               0.0379             0.6674       -0.7515 -0.6481
## CV residual            -0.0940            -0.0653        0.0679  0.0419
##             Slieve Donard Flagstaff to Carling Slieve Gallion
## Predicted         -0.1044               0.4017        -0.5342
## cvpred            -0.0958               0.4135        -0.5441
## log(time)         -0.0530               0.3763        -0.4524
## CV residual        0.0428              -0.0372         0.0917
##             BARF Turkey Trot
## Predicted            -0.3913
## cvpred               -0.4010
## log(time)            -0.3382
## CV residual           0.0628
## 
## Sum of squares = 0.04    Mean square = 0    n = 8 
## 
## fold 3 
## Observations in test set: 8 
##             Glenariff Mountain Tollymore Mountain Moughanmore
## Predicted               -0.446            -0.6989    -0.80639
## cvpred                  -0.462            -0.7020    -0.76628
## log(time)               -0.352            -0.7270    -0.76871
## CV residual              0.110            -0.0251    -0.00243
##             Loughshannagh Horseshoe Meelbeg Meelmore Donard Forest
## Predicted                   -0.5028           -0.616        -0.562
## cvpred                      -0.4897           -0.586        -0.555
## log(time)                   -0.4355           -0.789        -0.657
## CV residual                  0.0542           -0.202        -0.101
##             Seven Sevens Scrabo Hill Race
## Predicted          1.270         -1.15233
## cvpred             1.187         -1.11845
## log(time)          1.362         -1.12479
## CV residual        0.175         -0.00634
## 
## Sum of squares = 0.1    Mean square = 0.01    n = 8 
## 
## Overall (Sum over all 8 folds) 
##      ms 
## 0.00815

Sum the MSE for each fold, divide by the number of observations, and take the square root to get the cross-validated standard error of estimate.

You can assess R2 shrinkage via K-fold cross-validation. Using the crossval() function from the bootstrap package.

4.6.5 Variable Selection

Selecting a subset of predictor variables from a larger set (e.g., stepwise selection) is a controversial topic. You can perform stepwise selection (forward, backward, both) using the stepAIC( ) function from the MASS package. stepAIC( ) performs stepwise model selection by exact AIC.

# Stepwise Regression
step <- stepAIC(fit, direction="both")
## Start:  AIC=27.5
## Circonference ~ X + Y + NumArbre
## 
##            Df Sum of Sq  RSS  AIC
## - Y         1      1.13 54.2 25.9
## - NumArbre  1      2.85 55.9 26.5
## <none>                  53.0 27.5
## - X         1      8.08 61.1 28.2
## 
## Step:  AIC=25.9
## Circonference ~ X + NumArbre
## 
##            Df Sum of Sq  RSS  AIC
## - NumArbre  1      1.74 55.9 24.5
## <none>                  54.2 25.9
## - X         1      8.80 63.0 26.8
## + Y         1      1.13 53.0 27.5
## 
## Step:  AIC=24.5
## Circonference ~ X
## 
##            Df Sum of Sq  RSS  AIC
## <none>                  55.9 24.5
## - X         1      7.87 63.8 25.0
## + NumArbre  1      1.74 54.2 25.9
## + Y         1      0.02 55.9 26.5
step$anova # display results

Alternatively, you can perform all-subsets regression using the leaps( ) function from the leaps package.

4.6.6 Relative Importance

The relaimpo package provides measures of relative importance for each of the predictors in the model.

library(relaimpo)
# Calculate Relative Importance for Each Predictor
calc.relimp(fit,type=c("lmg","last","first","pratt"),
   rela=TRUE)
## Response variable: Circonference 
## Total response variance: 3.54 
## Analysis based on 19 observations 
## 
## 3 Regressors: 
## X Y NumArbre 
## Proportion of variance explained by model: 16.9%
## Metrics are normalized to sum to 100% (rela=TRUE). 
## 
## Relative importance metrics: 
## 
##             lmg  last  first  pratt
## X        0.7482 0.670 0.8703 0.7540
## Y        0.0755 0.094 0.0401 0.0731
## NumArbre 0.1763 0.236 0.0896 0.1729
## 
## Average coefficients for different model sizes: 
## 
##                1X      2Xs     3Xs
## X        -0.01766 -0.01819 -0.0182
## Y        -0.00519 -0.00777 -0.0112
## NumArbre  0.03772  0.06664  0.0864
# Bootstrap Measures of Relative Importance (!! 100 samples) 
boot <- boot.relimp(fit, b = 100, type = c("lmg", 
  "last", "first", "pratt"), rank = TRUE, 
  diff = TRUE, rela = TRUE)
booteval.relimp(boot) # print result
## Response variable: Circonference 
## Total response variance: 3.54 
## Analysis based on 19 observations 
## 
## 3 Regressors: 
## X Y NumArbre 
## Proportion of variance explained by model: 16.9%
## Metrics are normalized to sum to 100% (rela=TRUE). 
## 
## Relative importance metrics: 
## 
##             lmg  last  first  pratt
## X        0.7482 0.670 0.8703 0.7540
## Y        0.0755 0.094 0.0401 0.0731
## NumArbre 0.1763 0.236 0.0896 0.1729
## 
## Average coefficients for different model sizes: 
## 
##                1X      2Xs     3Xs
## X        -0.01766 -0.01819 -0.0182
## Y        -0.00519 -0.00777 -0.0112
## NumArbre  0.03772  0.06664  0.0864
## 
##  
##  Confidence interval information ( 100 bootstrap replicates, bty= perc ): 
## Relative Contributions with confidence intervals: 
##  
##                               Lower  Upper
##                percentage 0.95 0.95    0.95   
## X.lmg           0.7482    ABC   0.0254  0.9590
## Y.lmg           0.0755    ABC   0.0144  0.6240
## NumArbre.lmg    0.1763    ABC   0.0068  0.8280
##                                               
## X.last          0.6700    ABC   0.0015  0.9680
## Y.last          0.0940    ABC   0.0002  0.6900
## NumArbre.last   0.2360    ABC   0.0063  0.9520
##                                               
## X.first         0.8703    ABC   0.0046  0.9910
## Y.first         0.0401    ABC   0.0002  0.6660
## NumArbre.first  0.0896    ABC   0.0001  0.8870
##                                               
## X.pratt         0.7540    ABC  -0.2120  1.0150
## Y.pratt         0.0731    ABC  -0.1400  0.8290
## NumArbre.pratt  0.1729    ABC  -0.1300  1.0700
## 
## Letters indicate the ranks covered by bootstrap CIs. 
## (Rank bootstrap confidence intervals always obtained by percentile method) 
## CAUTION: Bootstrap confidence intervals can be somewhat liberal. 
## 
##  
##  Differences between Relative Contributions: 
##  
##                                  Lower   Upper
##                  difference 0.95 0.95    0.95   
## X-Y.lmg           0.6727         -0.5386  0.9374
## X-NumArbre.lmg    0.5718         -0.7951  0.9524
## Y-NumArbre.lmg   -0.1008         -0.7251  0.4180
##                                                 
## X-Y.last          0.5756         -0.6131  0.9610
## X-NumArbre.last   0.4332         -0.9397  0.9499
## Y-NumArbre.last  -0.1424         -0.9157  0.5406
##                                                 
## X-Y.first         0.8302         -0.5933  0.9863
## X-NumArbre.first  0.7806         -0.8572  0.9871
## Y-NumArbre.first -0.0496         -0.8419  0.5868
##                                                 
## X-Y.pratt         0.6809         -0.8180  1.0218
## X-NumArbre.pratt  0.5811         -1.1032  1.0431
## Y-NumArbre.pratt -0.0997         -1.1677  0.8540
## 
## * indicates that CI for difference does not include 0. 
## CAUTION: Bootstrap confidence intervals can be somewhat liberal.
plot(booteval.relimp(boot,sort=TRUE)) # plot result

4.7 Regression Diagnostics

An excellent review of regression diagnostics is provided in the vignette of the car package provides advanced utilities for regression modeling.

Assume that we are fitting a multiple linear regression on the MTCARS data

library(car)
fit <- lm(mpg~disp+hp+wt+drat, data=mtcars)
summary(fit)
## 
## Call:
## lm(formula = mpg ~ disp + hp + wt + drat, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.508 -1.905 -0.506  0.982  5.688 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.14874    6.29359    4.63  8.2e-05 ***
## disp         0.00382    0.01080    0.35   0.7268    
## hp          -0.03478    0.01160   -3.00   0.0058 ** 
## wt          -3.47967    1.07837   -3.23   0.0033 ** 
## drat         1.76805    1.31978    1.34   0.1915    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.6 on 27 degrees of freedom
## Multiple R-squared:  0.838,  Adjusted R-squared:  0.814 
## F-statistic: 34.8 on 4 and 27 DF,  p-value: 2.7e-10

This example is for exposition only. We will ignore the fact that this may not be a great way of modeling the this particular set of data!

4.7.1 Outliers

# Assessing Outliers
outlierTest(fit)                # Bonferonni p-value for most extreme obs
## 
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##                rstudent unadjusted p-value Bonferonni p
## Toyota Corolla     2.52             0.0184        0.588
qqPlot(fit, main="QQ Plot")     # qq plot for studentized resid 

leveragePlots(fit)              # leverage plots

4.7.2 Influential Observations

# Influential Observations
# added variable plots 
av.plots(fit)

# Cook's D plot
# identify D values > 4/(n-k-1) 
cutoff <- 4/((nrow(mtcars)-length(fit$coefficients)-2)) 
plot(fit, which=4, cook.levels=cutoff)

# Influence Plot 
influencePlot(fit,  id.method="identify", main="Influence Plot", sub="Circle size is proportial to Cook's Distance" )

4.7.3 Non-normality

# Normality of Residuals, qq plot for studentized resid
qqPlot(fit, main="QQ Plot")

# distribution of studentized residuals
sresid <- studres(fit) 
hist(sresid, freq=FALSE, main="Distribution of Studentized Residuals")
xfit<-seq(min(sresid),max(sresid),length=40) 
yfit<-dnorm(xfit) 
lines(xfit, yfit)

4.7.4 Non-constant Error Variance

# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(fit)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 1.43    Df = 1     p = 0.232
# plot studentized residuals vs. fitted values 
spreadLevelPlot(fit)

## 
## Suggested power transformation:  0.662

4.7.5 Multi-collinearity

# Evaluate Collinearity
vif(fit) # variance inflation factors 
## disp   hp   wt drat 
## 8.21 2.89 5.10 2.28
sqrt(vif(fit)) > 2 # problem?
##  disp    hp    wt  drat 
##  TRUE FALSE  TRUE FALSE
# Evaluate Nonlinearity
# component + residual plot 
crPlots(fit)

# Ceres plots 
ceresPlots(fit)

Ceres plots are a generalization of component+residual (partial residual) plots that are less prone to leakage of nonlinearity among the predictors.

4.7.6 Non-independence of Errors

# Test for Autocorrelated Errors
durbinWatsonTest(fit)
##  lag Autocorrelation D-W Statistic p-value
##    1           0.101          1.74   0.308
##  Alternative hypothesis: rho != 0

4.7.7 Additional Diagnostic Help

The gvlma( ) function in the gvlma package, performs a global validation of linear model assumptions as well separate evaluations of skewness, kurtosis, and heteroscedasticity.

# Global test of model assumptions
library(gvlma)
gvmodel <- gvlma(fit) 
summary(gvmodel)
## 
## Call:
## lm(formula = mpg ~ disp + hp + wt + drat, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.508 -1.905 -0.506  0.982  5.688 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.14874    6.29359    4.63  8.2e-05 ***
## disp         0.00382    0.01080    0.35   0.7268    
## hp          -0.03478    0.01160   -3.00   0.0058 ** 
## wt          -3.47967    1.07837   -3.23   0.0033 ** 
## drat         1.76805    1.31978    1.34   0.1915    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.6 on 27 degrees of freedom
## Multiple R-squared:  0.838,  Adjusted R-squared:  0.814 
## F-statistic: 34.8 on 4 and 27 DF,  p-value: 2.7e-10
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = fit) 
## 
##                      Value p-value                   Decision
## Global Stat        13.9382 0.00750 Assumptions NOT satisfied!
## Skewness            4.3131 0.03782 Assumptions NOT satisfied!
## Kurtosis            0.0138 0.90654    Assumptions acceptable.
## Link Function       8.7166 0.00315 Assumptions NOT satisfied!
## Heteroscedasticity  0.8947 0.34421    Assumptions acceptable.

4.8 ANOVA

If you have been analyzing ANOVA designs in traditional statistical packages, you are likely to find R’s approach less coherent and user-friendly. A good online presentation on ANOVA in R can be found in ANOVA section of the Personality Project. (Note: I have found that these pages render fine in Chrome and Safari browsers, but can appear distorted in iExplorer.)

4.8.1 Fit a Model

4.8.1.1 One Way Anova (Completely Randomized Design)

str(npk)
## 'data.frame':    24 obs. of  5 variables:
##  $ block: Factor w/ 6 levels "1","2","3","4",..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ N    : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 1 1 2 ...
##  $ P    : Factor w/ 2 levels "0","1": 2 2 1 1 1 2 1 2 2 2 ...
##  $ K    : Factor w/ 2 levels "0","1": 2 1 1 2 1 2 2 1 1 2 ...
##  $ yield: num  49.5 62.8 46.8 57 59.8 58.5 55.5 56 62.8 55.8 ...
summary(aov(yield ~ N, npk))
##             Df Sum Sq Mean Sq F value Pr(>F)  
## N            1    189   189.3    6.06  0.022 *
## Residuals   22    687    31.2                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4.8.1.2 Randomized Block Design (B is the blocking factor)

str(npk)
## 'data.frame':    24 obs. of  5 variables:
##  $ block: Factor w/ 6 levels "1","2","3","4",..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ N    : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 1 1 2 ...
##  $ P    : Factor w/ 2 levels "0","1": 2 2 1 1 1 2 1 2 2 2 ...
##  $ K    : Factor w/ 2 levels "0","1": 2 1 1 2 1 2 2 1 1 2 ...
##  $ yield: num  49.5 62.8 46.8 57 59.8 58.5 55.5 56 62.8 55.8 ...
summary(aov(yield ~ N + block, npk))
##             Df Sum Sq Mean Sq F value Pr(>F)   
## N            1    189   189.3    9.36 0.0071 **
## block        5    343    68.7    3.40 0.0262 * 
## Residuals   17    344    20.2                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4.8.1.3 Two Way Factorial Design

summary(aov(yield ~ N + P + N:P, npk))
##             Df Sum Sq Mean Sq F value Pr(>F)  
## N            1    189   189.3    5.76  0.026 *
## P            1      8     8.4    0.26  0.619  
## N:P          1     21    21.3    0.65  0.430  
## Residuals   20    657    32.9                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(yield ~ N*P, npk))          # same thing
##             Df Sum Sq Mean Sq F value Pr(>F)  
## N            1    189   189.3    5.76  0.026 *
## P            1      8     8.4    0.26  0.619  
## N:P          1     21    21.3    0.65  0.430  
## Residuals   20    657    32.9                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4.8.1.4 Analysis of Covariance

summary(aov(yield ~ N + rnorm(24, 0, 1), npk))
##                 Df Sum Sq Mean Sq F value Pr(>F)  
## N                1    189   189.3    6.25  0.021 *
## rnorm(24, 0, 1)  1     51    50.8    1.68  0.209  
## Residuals       21    636    30.3                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4.8.2 Look at Diagnostic Plots

Diagnostic plots provide checks for heteroscedasticity, normality, and influential observerations.

layout(matrix(c(1,2,3,4),2,2))    # optional layout 
plot(aov(yield ~ N, npk))         # diagnostic plots

4.8.3 Evaluate Model Effects

summary(aov(yield ~ N + P + N:P, npk)) # display Type I ANOVA table
##             Df Sum Sq Mean Sq F value Pr(>F)  
## N            1    189   189.3    5.76  0.026 *
## P            1      8     8.4    0.26  0.619  
## N:P          1     21    21.3    0.65  0.430  
## Residuals   20    657    32.9                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
drop1(aov(yield ~ N + P + N:P, npk),~.,test="F") # type III SS and F Tests

Nonparametric and resampling alternatives are available.

4.8.4 Multiple Comparisons

You can get Tukey HSD tests using the function below. By default, it calculates post hoc comparisons on each factor in the model. You can specify specific factors as an option.

# Tukey Honestly Significant Differences
TukeyHSD(aov(yield ~ N + P + N:P, npk)) # where fit comes from aov()
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = yield ~ N + P + N:P, data = npk)
## 
## $N
##     diff   lwr  upr p adj
## 1-0 5.62 0.734 10.5 0.026
## 
## $P
##      diff   lwr upr p adj
## 1-0 -1.18 -6.07 3.7 0.619
## 
## $`N:P`
##          diff    lwr   upr p adj
## 1:0-0:0  7.50  -1.76 16.76 0.140
## 0:1-0:0  0.70  -8.56  9.96 0.997
## 1:1-0:0  4.43  -4.83 13.70 0.550
## 0:1-1:0 -6.80 -16.06  2.46 0.202
## 1:1-1:0 -3.07 -12.33  6.20 0.791
## 1:1-0:1  3.73  -5.53 13.00 0.677

4.8.5 Visualizing Results

# Two-way Interaction Plot 
attach(mtcars)
gear <- factor(gear)
cyl <- factor(cyl)
interaction.plot(cyl, gear, mpg, type="b", col=c(1:3), 
    leg.bty="o", leg.bg="beige", lwd=2, pch=c(18,24,22),    
   xlab="Number of Cylinders", 
   ylab="Mean Miles Per Gallon", 
   main="Interaction Plot")

# Plot Means with Error Bars
library(gplots)
attach(mtcars)
cyl <- factor(cyl)
plotmeans(mpg~cyl,xlab="Number of Cylinders",
  ylab="Miles Per Gallon", main="Mean Plot\nwith 95% CI")

5 Graphs

One of the main reasons data analysts turn to R is for its strong graphic capabilities.

Creating a Graph provides an overview of creating and saving graphs in R.

The remainder of the section describes how to create basic graph types. These include density plots (histograms and kernel density plots), dot plots, bar charts (simple, stacked, grouped), line charts, pie charts (simple, annotated, 3D), boxplots (simple, notched, violin plots, bagplots) and Scatterplots (simple, with fit lines, scatterplot matrices, high density plots, and 3D plots).

The Advanced Graphs section describes how to customize and annotate graphs, and covers more statistically complex types of graphs.

5.1 Creating a Graph

In R, graphs are typically created interactively.

attach(mtcars)
plot(wt, mpg) 
abline(lm(mpg~wt))
title("Regression of MPG on Weight")

The plot( ) function opens a graph window and plots weight vs. miles per gallon. The next line of code adds a regression line to this graph. The final line adds a title.

5.1.1 Saving Graphs

You can save the graph in a variety of formats from the menu File -> Save As

You can also save the graph via code using one of the following functions.

Function Output to
pdf(“mygraph.pdf”) pdf file
win.metafile(“mygraph.wmf”) windows metafile
png(“mygraph.png”) png file
jpeg(“mygraph.jpg”) jpeg file
bmp(“mygraph.bmp”) bmp file
postscript(“mygraph.ps”) postscript file

5.1.2 Viewing Several Graphs

Creating a new graph by issuing a high level plotting command (plot, hist, boxplot, etc.) will typically overwrite a previous graph. To avoid this, open a new graph window before creating a new graph. To open a new graph window use one of the functions below.

Function Platform
windows() Windows
X11() Unix
quartz() Mac

You can have multiple graph windows open at one time. See help(dev.cur) for more details.

5.1.3 Graphical Parameters

You can specify fonts, colors, line styles, axes, reference lines, etc. by specifying graphical parameters. This allows a wide degree of customization. Graphical parameters, are covered in the Advanced Graphs section. The Advanced Graphs section also includes a more detailed coverage of axis and text customization.

5.2 Histograms and Density Plots

5.2.1 Histograms

You can create histograms with the function hist(x) where x is a numeric vector of values to be plotted. The option freq=FALSE plots probability densities instead of frequencies. The option breaks= controls the number of bins. #### Simple Histogram

hist(mtcars$mpg)

5.2.1.1 Colored Histogram with Different Number of Bins

hist(mtcars$mpg, breaks=12, col="red")

5.2.1.2 Add a Normal Curve

x <- mtcars$mpg 
h<-hist(x, breaks=10, col="red", xlab="Miles Per Gallon", 
    main="Histogram with Normal Curve") 
xfit<-seq(min(x),max(x),length=40) 
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x)) 
yfit <- yfit*diff(h$mids[1:2])*length(x) 
lines(xfit, yfit, col="blue", lwd=2)

5.2.1.3 Pay attention

Histograms can be a poor method for determining the shape of a distribution because it is so strongly affected by the number of bins used.

par(mfrow=c(1,2))
hist(mtcars$mpg, breaks=3, col="red")
hist(mtcars$mpg, breaks=6, col="red")

5.2.2 Kernel Density Plots

Kernal density plots are usually a much more effective way to view the distribution of a variable. Create the plot using plot(density(x)) where x is a numeric vector.

5.2.2.1 Simple Plot

d <- density(mtcars$mpg) # returns the density data 
plot(d) # plots the results

5.2.2.2 Filled Density Plot

plot(d, main="Kernel Density of Miles Per Gallon")
polygon(d, col="red", border="blue")

5.2.2.3 Comparing Groups via Kernal Density

The sm.density.compare( ) function in the sm package allows you to superimpose the kernal density plots of two or more groups. The format is sm.density.compare(x, factor) where x is a numeric vector and factor is the grouping variable.

# Compare MPG distributions for cars with 4,6, or 8 cylinders
library(sm)
cyl.f <- factor(cyl, levels= c(4,6,8),
  labels = c("4 cylinder", "6 cylinder", "8 cylinder")) 
sm.density.compare(mpg, cyl, xlab="Miles Per Gallon")
title(main="MPG Distribution by Car Cylinders")

5.3 Dot Plots

Create dotplots with the dotchart(x, labels=) function, where x is a numeric vector and labels is a vector of labels for each point. You can add a groups= option to designate a factor specifying how the elements of x are grouped. If so, the option gcolor= controls the color of the groups, label.cex controls the size of the labels.

5.3.1 Simple Dotplot

dotchart(mtcars$mpg,labels=row.names(mtcars),cex=.7,
    main="Gas Milage for Car Models", 
   xlab="Miles Per Gallon")

5.3.2 Dotplot: Grouped Sorted and Colored

Sort by mpg, group and color by cylinder

x <- mtcars[order(mtcars$mpg),] # sort by mpg
x$cyl <- factor(x$cyl) # it must be a factor
x$color[x$cyl==4] <- "red"
x$color[x$cyl==6] <- "blue"
x$color[x$cyl==8] <- "darkgreen"    
dotchart(x$mpg,labels=row.names(x),cex=.7,groups= x$cyl,
    main="Gas Milage for Car Models\ngrouped by cylinder",
   xlab="Miles Per Gallon", gcolor="black", color=x$color)

5.4 Bar Plots

Create barplots with the barplot(height) function, where height is a vector or matrix. If height is a vector, the values determine the heights of the bars in the plot. If height is a matrix and the option beside=FALSE then each bar of the plot corresponds to a column of height, with the values in the column giving the heights of stacked ???sub-bars???. If height is a matrix and beside=TRUE, then the values in each column are juxtaposed rather than stacked. Include option names.arg=(character vector) to label the bars. The option horiz=TRUE to createa a horizontal barplot.

5.4.1 Simple Vertical Bar Plot

counts <- table(mtcars$gear)
barplot(counts, main="Car Distribution", 
    xlab="Number of Gears")

5.4.2 Simple Horizontal Bar Plot

counts <- table(mtcars$gear)
barplot(counts, main="Car Distribution", horiz=TRUE,
  names.arg=c("3 Gears", "4 Gears", "5 Gears"))

5.4.3 Stacked Bar Plot

counts <- table(mtcars$vs, mtcars$gear)
barplot(counts, main="Car Distribution by Gears and VS",
  xlab="Number of Gears", col=c("darkblue","red"),
    legend = rownames(counts))

5.4.4 Grouped Bar Plot

counts <- table(mtcars$vs, mtcars$gear)
barplot(counts, main="Car Distribution by Gears and VS",
  xlab="Number of Gears", col=c("darkblue","red"),
    legend = rownames(counts), beside=TRUE)

5.4.5 Customizing

Bar plots need not be based on counts or frequencies. You can create bar plots that represent means, medians, standard deviations, etc. Use the aggregate( ) function and pass the results to the barplot( ) function. By default, the categorical axis line is suppressed. Include the option axis.lty=1 to draw it. With many bars, bar labels may start to overlap. You can decrease the font size using the cex.names option. Values smaller than one will shrink the size of the label. Additionally, you can use graphical parameters such as the following to help text spacing:

par(las=2) # make label text perpendicular to axis
par(mar=c(5,8,4,2)) # increase y-axis margin.
counts <- table(mtcars$gear)
barplot(counts, main="Car Distribution", horiz=TRUE, names.arg=c("3 Gears", "4 Gears", "5   Gears"), cex.names=0.8)

5.5 Line Charts

Line charts are created with the function lines(x, y, type=) where x and y are numeric vectors of (x,y) points to connect. type= can take the following values:

Type Description
p points
l lines
o overplotted points and lines
b, c points (empty if “c”) joined by lines
s, S stair steps
h histogram-like vertical lines
n does not produce any points or lines

The lines( ) function adds information to a graph. It can not produce a graph on its own. Usually it follows a plot(x, y) command that produces a graph.

By default, plot( ) plots the (x,y) points. Use the type=“n” option in the plot( ) command, to create the graph with axes, titles, etc., but without plotting the points.

Example: In the following code each of the type= options is applied to the same dataset. The plot( ) command sets up the graph, but does not plot the points.

x <- c(1:5); y <- x # create some data 
par(pch=22, col="red") # plotting symbol and color 
par(mfrow=c(2,4)) # all plots on one page 
opts = c("p","l","o","b","c","s","S","h") 
for(i in 1:length(opts)){ 
  heading = paste("type=",opts[i]) 
  plot(x, y, type="n", main=heading) 
  lines(x, y, type=opts[i]) 
}

Next, we demonstrate each of the type= options when plot( ) sets up the graph and does plot the points.

x <- c(1:5); y <- x # create some data
par(pch=22, col="blue") # plotting symbol and color 
par(mfrow=c(2,4)) # all plots on one page 
opts = c("p","l","o","b","c","s","S","h") 
for(i in 1:length(opts)){ 
  heading = paste("type=",opts[i]) 
  plot(x, y, main=heading) 
  lines(x, y, type=opts[i]) 
}

As you can see, the type=“c” option only looks different from the type=“b” option if the plotting of points is suppressed in the plot( ) command.

To demonstrate the creation of a more complex line chart, let’s plot the growth of 5 orange trees over time. Each tree will have its own distinctive line. The data come from the dataset Orange.

# convert factor to numeric for convenience 
Orange$Tree <- as.numeric(Orange$Tree) 
ntrees <- max(Orange$Tree)

# get the range for the x and y axis 
xrange <- range(Orange$age) 
yrange <- range(Orange$circumference) 

# set up the plot 
plot(xrange, yrange, type="n", xlab="Age (days)",
    ylab="Circumference (mm)" ) 
colors <- rainbow(ntrees) 
linetype <- c(1:ntrees) 
plotchar <- seq(18,18+ntrees,1)

# add lines 
for (i in 1:ntrees) { 
  tree <- subset(Orange, Tree==i) 
  lines(tree$age, tree$circumference, type="b", lwd=1.5,
    lty=linetype[i], col=colors[i], pch=plotchar[i]) 
}

5.6 Pie Chart

Pie charts are not recommended in the R documentation, and their features are somewhat limited. The authors recommend bar or dot plots over pie charts because people are able to judge length more accurately than volume. Pie charts are created with the function pie(x, labels=) where x is a non-negative numeric vector indicating the area of each slice and labels= notes a character vector of names for the slices.

5.6.1 Simple Pie Chart

slices <- c(10, 12,4, 16, 8)
lbls <- c("US", "UK", "Australia", "Germany", "France")
pie(slices, labels = lbls, main="Pie Chart of Countries")

5.6.2 Pie Chart with Percentages

slices <- c(10, 12, 4, 16, 8) 
lbls <- c("US", "UK", "Australia", "Germany", "France")
pct <- round(slices/sum(slices)*100)
lbls <- paste(lbls, pct) # add percents to labels 
lbls <- paste(lbls,"%",sep="") # ad % to labels 
pie(slices,labels = lbls, col=rainbow(length(lbls)),
    main="Pie Chart of Countries")

5.6.3 3D Pie Chart

library(plotrix)
slices <- c(10, 12, 4, 16, 8) 
lbls <- c("US", "UK", "Australia", "Germany", "France")
pie3D(slices,labels=lbls,explode=0.1,
    main="Pie Chart of Countries ")

Note you can play with explode !

5.7 Boxplots

Boxplots can be created for individual variables or for variables by group. The format is boxplot(x, data=), where x is a formula and data= denotes the data frame providing the data. An example of a formula is y~group where a separate boxplot for numeric variable y is generated for each value of group. Add varwidth=TRUE to make boxplot widths proportional to the square root of the samples sizes. Add horizontal=TRUE to reverse the axis orientation.

5.7.1 Boxplots

# Boxplot of MPG by Car Cylinders 
boxplot(mpg~cyl,data=mtcars, main="Car Milage Data", 
    xlab="Number of Cylinders", ylab="Miles Per Gallon")

# Notched Boxplot of Tooth Growth Against 2 Crossed Factors
# boxes colored for ease of interpretation 
boxplot(len~supp*dose, data=ToothGrowth, notch=TRUE, 
  col=(c("gold","darkgreen")),
  main="Tooth Growth", xlab="Suppliment and Dose")

5.7.2 Violin Plots

A violin plot is a combination of a boxplot and a kernel density plot. They can be created using the vioplot( ) function from vioplot package.

library(vioplot)
x1 <- mtcars$mpg[mtcars$cyl==4]
x2 <- mtcars$mpg[mtcars$cyl==6]
x3 <- mtcars$mpg[mtcars$cyl==8]
vioplot(x1, x2, x3, names=c("4 cyl", "6 cyl", "8 cyl"), 
   col="gold")
title("Violin Plots of Miles Per Gallon")

5.8 Scatterplots

5.8.1 Simple Scatterplot

There are many ways to create a scatterplot in R. The basic function is plot(x, y), where x and y are numeric vectors denoting the (x,y) points to plot.

attach(mtcars)
plot(wt, mpg, main="Scatterplot Example", 
    xlab="Car Weight ", ylab="Miles Per Gallon ", pch=19)
# Add fit lines
abline(lm(mpg~wt), col="red") # regression line (y~x) 
lines(lowess(wt,mpg), col="blue") # lowess line (x,y)

The scatterplot( ) function in the car package offers many enhanced features, including fit lines, marginal box plots, conditioning on a factor, and interactive point identification. Each of these features is optional.

# Enhanced Scatterplot of MPG vs. Weight 
# by Number of Car Cylinders 
library(car) 
scatterplot(mpg ~ wt | cyl, data=mtcars, 
    xlab="Weight of Car", ylab="Miles Per Gallon", 
   main="Enhanced Scatter Plot", 
   labels=row.names(mtcars))

5.8.2 Scatterplot Matrices

There are at least 4 useful functions for creating scatterplot matrices. Analysts must love scatterplot matrices!

5.8.2.1 Basic Scatterplot Matrix

pairs(~mpg+disp+drat+wt,data=mtcars, 
   main="Simple Scatterplot Matrix")

5.8.3 car Scatterplot Matrices

The car package can condition the scatterplot matrix on a factor, and optionally include lowess and linear best fit lines, and boxplot, densities, or histograms in the principal diagonal, as well as rug plots in the margins of the cells.

library(car)
scatterplot.matrix(~mpg+disp+drat+wt|cyl, data=mtcars,
    main="Three Cylinder Options")

5.8.4 gclus Scatterplot Matrices

The gclus package provides options to rearrange the variables so that those with higher correlations are closer to the principal diagonal. It can also color code the cells to reflect the size of the correlations.

library(gclus)
dta <- mtcars[c(1,3,5,6)]       # get data 
dta.r <- abs(cor(dta))          # get correlations
dta.col <- dmat.color(dta.r)    # get colors
# reorder variables so those with highest correlation
# are closest to the diagonal
dta.o <- order.single(dta.r) 
cpairs(dta, dta.o, panel.colors=dta.col, gap=.5,
main="Variables Ordered and Colored by Correlation" )

5.8.5 High Density Scatterplots

When there are many data points and significant overlap, scatterplots become less useful. There are several approaches that be used when this occurs. The hexbin(x, y) function in the hexbin package provides bivariate binning into hexagonal cells (it looks better than it sounds).

library(hexbin)
x <- rnorm(10000)
y <- rnorm(10000)
bin<-hexbin(x, y, xbins=50) 
plot(bin, main="Hexagonal Binning")

5.8.6 3D Scatterplots

You can create a 3D scatterplot with the scatterplot3d package. Use the function scatterplot3d(x, y, z).

library(scatterplot3d)
attach(mtcars)
scatterplot3d(wt,disp,mpg, main="3D Scatterplot")

with Coloring and Vertical Drop Lines

scatterplot3d(wt,disp,mpg, pch=16, highlight.3d=TRUE,
  type="h", main="3D Scatterplot")

*3D scatterplot with drop lines click to view

with Coloring and Vertical Lines and Regression Plane

s3d <-scatterplot3d(wt,disp,mpg, pch=16, highlight.3d=TRUE,
  type="h", main="3D Scatterplot")
fit <- lm(mpg ~ wt+disp) 
s3d$plane3d(fit)

6 Multivariate Analyses

6.1 Introduction to vegan

library(vegan)
data(dune)

We study first the lichen pasture data with only continuous constraints. The data sets are taken into use with

data(varespec)
data(varechem)

The environmental data can be inspected using commands str which shows the structure of any object in a compact form, and asking for a summary of a data frame

str(varechem)
## 'data.frame':    24 obs. of  14 variables:
##  $ N       : num  19.8 13.4 20.2 20.6 23.8 22.8 26.6 24.2 29.8 28.1 ...
##  $ P       : num  42.1 39.1 67.7 60.8 54.5 40.9 36.7 31 73.5 40.5 ...
##  $ K       : num  140 167 207 234 181 ...
##  $ Ca      : num  519 357 973 834 777 ...
##  $ Mg      : num  90 70.7 209.1 127.2 125.8 ...
##  $ S       : num  32.3 35.2 58.1 40.7 39.5 40.8 33.8 27.1 42.5 60.2 ...
##  $ Al      : num  39 88.1 138 15.4 24.2 ...
##  $ Fe      : num  40.9 39 35.4 4.4 3 ...
##  $ Mn      : num  58.1 52.4 32.1 132 50.1 ...
##  $ Zn      : num  4.5 5.4 16.8 10.7 6.6 9.1 7.4 5.2 9.3 9.1 ...
##  $ Mo      : num  0.3 0.3 0.8 0.2 0.3 0.4 0.3 0.3 0.3 0.5 ...
##  $ Baresoil: num  43.9 23.6 21.2 18.7 46 40.5 23 29.8 17.6 29.9 ...
##  $ Humdepth: num  2.2 2.2 2 2.9 3 3.8 2.8 2 3 2.2 ...
##  $ pH      : num  2.7 2.8 3 2.8 2.7 2.7 2.8 2.8 2.8 2.8 ...
summary(varechem)
##        N              P              K               Ca      
##  Min.   :13.4   Min.   :22.7   Min.   : 43.6   Min.   : 188  
##  1st Qu.:18.8   1st Qu.:32.6   1st Qu.:127.2   1st Qu.: 426  
##  Median :22.1   Median :41.5   Median :166.6   Median : 518  
##  Mean   :22.4   Mean   :45.1   Mean   :162.9   Mean   : 570  
##  3rd Qu.:26.3   3rd Qu.:57.0   3rd Qu.:205.8   3rd Qu.: 739  
##  Max.   :33.1   Max.   :73.5   Max.   :313.8   Max.   :1170  
##        Mg              S              Al            Fe       
##  Min.   : 25.7   Min.   :14.9   Min.   : 12   Min.   :  2.3  
##  1st Qu.: 60.9   1st Qu.:29.4   1st Qu.: 38   1st Qu.:  5.6  
##  Median : 75.0   Median :36.2   Median :107   Median : 27.8  
##  Mean   : 87.5   Mean   :37.2   Mean   :142   Mean   : 49.6  
##  3rd Qu.:108.7   3rd Qu.:43.6   3rd Qu.:234   3rd Qu.: 85.2  
##  Max.   :209.1   Max.   :60.2   Max.   :435   Max.   :204.4  
##        Mn              Zn              Mo           Baresoil   
##  Min.   : 10.1   Min.   : 2.60   Min.   :0.050   Min.   : 0.0  
##  1st Qu.: 26.7   1st Qu.: 5.38   1st Qu.:0.275   1st Qu.: 8.0  
##  Median : 36.5   Median : 8.10   Median :0.300   Median :21.2  
##  Mean   : 49.3   Mean   : 7.60   Mean   :0.396   Mean   :22.9  
##  3rd Qu.: 59.0   3rd Qu.: 9.10   3rd Qu.:0.500   3rd Qu.:30.8  
##  Max.   :132.0   Max.   :16.80   Max.   :1.100   Max.   :56.9  
##     Humdepth          pH      
##  Min.   :1.00   Min.   :2.70  
##  1st Qu.:1.80   1st Qu.:2.80  
##  Median :2.20   Median :2.90  
##  Mean   :2.20   Mean   :2.93  
##  3rd Qu.:2.62   3rd Qu.:3.00  
##  Max.   :3.80   Max.   :3.60

There is a difference between a data frame and a matrix in R: data frame is actually a list of variables which can be of different types (continuous, factors, ordered factors), whereas matrix only contains numbers of the same type. ` The dependencies among variables can be inspected visually using plot com- mand for data frames; under the hood this calls pairs function

plot(varechem, gap=0, panel=panel.smooth)

It is always useful to have a look at the data before rushing into analysis. It is necessary to have a look at the data if you read in your own data: you really must check that the data were imported correctly.

You can refer to rows and columns in the data frame in various ways. One way is to use numeric indices within square brackets ???[ ]???. The first item refers to a row, the second item to a column

varechem[3,7]
## [1] 138

Several items can be referred to by combining indices within c() or by using a regular sequence with ???:??? between first and last item:

varechem[2, c(3,1, 7)]
varechem[3:5, 7]
## [1] 138.0  15.4  24.2

If you omit one index, whole row or whole column will be given to you:

varechem[2,]
varechem[,7]
##  [1]  39.0  88.1 138.0  15.4  24.2 104.8  20.7  74.2  17.9 329.7  92.3
## [12] 124.3  12.1 294.9  39.0 155.1 304.6 435.1 316.5 227.1 108.8 168.2
## [23] 253.6  35.8

Finally, you can also refer to rows and columns by their names instead of numeric indices, and in data frames you can also use the dollar sign for variables:

varechem[, "pH"]
##  [1] 2.7 2.8 3.0 2.8 2.7 2.7 2.8 2.8 2.8 2.8 2.7 2.9 2.9 3.1 3.1 3.0 3.3
## [18] 2.9 2.9 3.0 2.9 3.2 3.6 3.0
varechem$pH
##  [1] 2.7 2.8 3.0 2.8 2.7 2.7 2.8 2.8 2.8 2.8 2.7 2.9 2.9 3.1 3.1 3.0 3.3
## [18] 2.9 2.9 3.0 2.9 3.2 3.6 3.0

6.2 Unconstrained Ordination

6.2.1 Nonmetric Multidimensional Scaling (NMDS)

Function metaMDS is in the vegan library. The function calls internally a function that runs the actual NMDS. The default NMDS engine is function monoMDS in the vegan package.

Running NMDS takes only one command:

m <- metaMDS(dune)
## Run 0 stress 0.119 
## Run 1 stress 0.119 
## ... Procrustes: rmse 0.000317  max resid 0.000976 
## ... Similar to previous best
## Run 2 stress 0.119 
## ... Procrustes: rmse 9.7e-05  max resid 0.000297 
## ... Similar to previous best
## Run 3 stress 0.118 
## ... New best solution
## ... Procrustes: rmse 0.0203  max resid 0.0649 
## Run 4 stress 0.118 
## ... Procrustes: rmse 1.5e-05  max resid 4.77e-05 
## ... Similar to previous best
## Run 5 stress 0.181 
## Run 6 stress 0.118 
## ... New best solution
## ... Procrustes: rmse 7.2e-06  max resid 1.76e-05 
## ... Similar to previous best
## Run 7 stress 0.181 
## Run 8 stress 0.118 
## ... New best solution
## ... Procrustes: rmse 5.09e-06  max resid 1.62e-05 
## ... Similar to previous best
## Run 9 stress 0.118 
## ... Procrustes: rmse 2.19e-05  max resid 6.7e-05 
## ... Similar to previous best
## Run 10 stress 0.204 
## Run 11 stress 0.208 
## Run 12 stress 0.119 
## Run 13 stress 0.118 
## ... Procrustes: rmse 1.04e-05  max resid 3.52e-05 
## ... Similar to previous best
## Run 14 stress 0.119 
## Run 15 stress 0.118 
## ... Procrustes: rmse 3.26e-06  max resid 8.31e-06 
## ... Similar to previous best
## Run 16 stress 0.119 
## Run 17 stress 0.181 
## Run 18 stress 0.208 
## Run 19 stress 0.181 
## Run 20 stress 0.119 
## *** Solution reached

Sometimes the metaMDS does not find the same best solution twice, and in that case you cannot be sure that the found best solution really is the global optimum. If you are uncertain, you can continue the iterations from your current solution by giving the name of your solution in the argument

m <- metaMDS(dune, previous=m)
## Starting from 2-dimensional configuration
## Run 0 stress 0.118 
## Run 1 stress 0.118 
## ... Procrustes: rmse 1.21e-05  max resid 3.54e-05 
## ... Similar to previous best
## Run 2 stress 0.118 
## ... Procrustes: rmse 1.52e-05  max resid 4.02e-05 
## ... Similar to previous best
## Run 3 stress 0.181 
## Run 4 stress 0.118 
## ... Procrustes: rmse 1.69e-05  max resid 4.28e-05 
## ... Similar to previous best
## Run 5 stress 0.119 
## Run 6 stress 0.118 
## ... Procrustes: rmse 2.59e-05  max resid 8.14e-05 
## ... Similar to previous best
## Run 7 stress 0.119 
## Run 8 stress 0.119 
## Run 9 stress 0.119 
## Run 10 stress 0.119 
## Run 11 stress 0.119 
## Run 12 stress 0.118 
## ... Procrustes: rmse 4.74e-05  max resid 0.000155 
## ... Similar to previous best
## Run 13 stress 0.119 
## Run 14 stress 0.119 
## Run 15 stress 0.119 
## Run 16 stress 0.118 
## ... Procrustes: rmse 8.48e-06  max resid 2.51e-05 
## ... Similar to previous best
## Run 17 stress 0.119 
## Run 18 stress 0.118 
## ... Procrustes: rmse 1.03e-05  max resid 2.57e-05 
## ... Similar to previous best
## Run 19 stress 0.118 
## ... Procrustes: rmse 2.42e-05  max resid 7.7e-05 
## ... Similar to previous best
## Run 20 stress 0.118 
## ... Procrustes: rmse 1.07e-05  max resid 2.85e-05 
## ... Similar to previous best
## *** Solution reached

We saved the results of the analysis in object called m. You can use any decent name. Decent names do not have spaces, they do not begin with a number, and they do not contain special characters that can be interpreted as mathematical operators such as +-/:*$&!??. If you do not save the results, they will be printed on the screen and will be lost once the screen scrolls out from the sight. The symbol <- saves the results, but you can also use more familiar =. If you type the name of your result object, the object will be printed on the screen

m
## 
## Call:
## metaMDS(comm = dune, previous.best = m) 
## 
## global Multidimensional Scaling using monoMDS
## 
## Data:     dune 
## Distance: bray 
## 
## Dimensions: 2 
## Stress:     0.118 
## Stress type 1, weak ties
## Two convergent solutions found after 40 tries
## Scaling: centring, PC rotation, halfchange scaling 
## Species: expanded scores based on 'dune'

However, this is only a printed presentation of the object, and the the result object contains much more detailed information. For instance, the ordination scores are not displayed, although they also are saved. You can see this by plotting the object

plot(m)

The default plot command used black circles for sample plots and red crosses for species. You can see the names for both by defining argument type=“t” for text

plot(m, type="t")

The metaMDS function decided many things for you. For instance, you did not need to define or calculate the dissimilarity index, but the function automatically used Bray-Curtis which is a popular choice. However, you can use any of the indices defined in function vegdist, or you can even define the name of some other function to calculate the dissimilarities. For instance, you can use the Euclidean distances which are commonly regarded as a poor choice for community data

m2 <- metaMDS(dune, dist="euclid")
## Run 0 stress 0.117 
## Run 1 stress 0.117 
## ... Procrustes: rmse 1.51e-06  max resid 3.31e-06 
## ... Similar to previous best
## Run 2 stress 0.117 
## ... New best solution
## ... Procrustes: rmse 1.24e-06  max resid 2.68e-06 
## ... Similar to previous best
## Run 3 stress 0.118 
## ... Procrustes: rmse 0.0171  max resid 0.0553 
## Run 4 stress 0.118 
## ... Procrustes: rmse 0.0171  max resid 0.0553 
## Run 5 stress 0.117 
## ... Procrustes: rmse 4.92e-07  max resid 9.12e-07 
## ... Similar to previous best
## Run 6 stress 0.118 
## ... Procrustes: rmse 0.0171  max resid 0.0553 
## Run 7 stress 0.117 
## ... Procrustes: rmse 3.43e-06  max resid 7.18e-06 
## ... Similar to previous best
## Run 8 stress 0.118 
## ... Procrustes: rmse 0.0171  max resid 0.0553 
## Run 9 stress 0.118 
## ... Procrustes: rmse 0.0171  max resid 0.0553 
## Run 10 stress 0.118 
## ... Procrustes: rmse 0.0171  max resid 0.0553 
## Run 11 stress 0.117 
## ... Procrustes: rmse 1.6e-06  max resid 4.36e-06 
## ... Similar to previous best
## Run 12 stress 0.118 
## ... Procrustes: rmse 0.0171  max resid 0.0552 
## Run 13 stress 0.117 
## ... Procrustes: rmse 8.73e-07  max resid 2.41e-06 
## ... Similar to previous best
## Run 14 stress 0.118 
## ... Procrustes: rmse 0.0171  max resid 0.0553 
## Run 15 stress 0.117 
## ... Procrustes: rmse 1.03e-06  max resid 2.78e-06 
## ... Similar to previous best
## Run 16 stress 0.117 
## ... Procrustes: rmse 4.1e-06  max resid 1.25e-05 
## ... Similar to previous best
## Run 17 stress 0.118 
## ... Procrustes: rmse 0.0171  max resid 0.0553 
## Run 18 stress 0.118 
## ... Procrustes: rmse 0.0171  max resid 0.0553 
## Run 19 stress 0.118 
## ... Procrustes: rmse 0.0171  max resid 0.0553 
## Run 20 stress 0.117 
## ... Procrustes: rmse 8.96e-07  max resid 2.38e-06 
## ... Similar to previous best
## *** Solution reached
m2
## 
## Call:
## metaMDS(comm = dune, distance = "euclid") 
## 
## global Multidimensional Scaling using monoMDS
## 
## Data:     dune 
## Distance: euclidean 
## 
## Dimensions: 2 
## Stress:     0.117 
## Stress type 1, weak ties
## Two convergent solutions found after 20 tries
## Scaling: centring, PC rotation 
## Species: expanded scores based on 'dune'

You can compare the shape of the nonlinear regression with these two method using function stressplot

stressplot(m)

stressplot(m2)

You can compare the shape of the nonlinear regression with these two method using function stressplot

stressplot(m)

stressplot(m2)

How do the stress plots differ from each other? You can directly compare the similarity of the results using Procrustes rotation (do so, and describe the differences)

plot(procrustes(m, m2))

6.2.2 Eigenvector Ordination: Principal Components Analysis

PCA can be run with command rda, or you can use standard R commands prcomp or princomp. We shall later use rda for constrained ordination, and therefore we use it also for PCA

ord <- rda(dune)
ord
## Call: rda(X = dune)
## 
##               Inertia Rank
## Total            84.1     
## Unconstrained    84.1   19
## Inertia is variance 
## 
## Eigenvalues for unconstrained axes:
##   PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8 
## 24.80 18.15  7.63  7.15  5.70  4.33  3.20  2.78 
## (Showed only 8 of all 19 unconstrained eigenvalues)

The results can be plotted with many alternative scaling systems for sites and species. The scales are described in help(scores.cca), and discussed deeply in vignette on Design Decisions which can be accessed using vegan command vegandocs(). You can inspect the visual effects by scrolling through the alternatives

plot(ord)

plot(ord, scal=1)

plot(ord, scal=3)

Study these alternative plot scalings: see ?scores.cca for explanation.

?scores.cca

In addition to standard plot function, you can also use biplot function which uses arrows for species instead of points.

biplot(ord, scal=2)

To give equal weights to all species, you should specify a new argument in the call

sord <- rda(dune, scal=TRUE)
sord
## Call: rda(X = dune, scale = TRUE)
## 
##               Inertia Rank
## Total              30     
## Unconstrained      30   19
## Inertia is correlations 
## 
## Eigenvalues for unconstrained axes:
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 7.03 5.00 3.55 2.64 2.14 1.76 1.48 1.32 
## (Showed only 8 of all 19 unconstrained eigenvalues)
plot(ord)

plot(sord)

biplot(sord)

How the results differ from unscaled PCA?

Eigenvector ordinations implement linear mapping of Euclidean distances onto ordination (check lectures). Nonmetric Multidimensional Scaling (NMDS) used nonlinear mapping of any distance or dissimilarity measure. You can study the effect of both nonlinear mapping and non-Euclidean distances using Procrustes rotation

plot(procrustes(m, ord))

plot(procrustes(m2, ord))

6.2.3 Eigenvector Ordination: Correspondence Analysis

CA analysis is similar to PCA, but uses command cca instead of rda

mca = cca(dune)
mca
## Call: cca(X = dune)
## 
##               Inertia Rank
## Total            2.12     
## Unconstrained    2.12   19
## Inertia is mean squared contingency coefficient 
## 
## Eigenvalues for unconstrained axes:
##   CA1   CA2   CA3   CA4   CA5   CA6   CA7   CA8 
## 0.536 0.400 0.260 0.176 0.145 0.108 0.092 0.081 
## (Showed only 8 of all 19 unconstrained eigenvalues)

Compare the output of cca to that of rda and find out how the differ (lecture slides may be useful). The plotting happens similarly as in PCA, and again there are several scaling alternatives; their description can be found with help(scores.cca), and inspected visually

plot(mca)

plot(mca, scal=1)

plot(mca, scal=2)

plot(mca, scal=3)

6.2.4 Anatomy of an R plot

6.2.4.1 Basic

We have already used standard plot function for ordination result. This is a quick and dirty solution to give the first impression on the results. For clean or publishable results we may need to exercise a closer control on the results, and therefore we need to understand the inner workings of R plots. The basic plot command does not only draw a graph, but most importantly it sets the plotting area and plotting scales. After plot you can add several new elements to the created plot. For instance, you add texts in margins, and therefore the plot command reserves empty space in the margins

plot(mca)
title(main = "Correspondence Analysis")

There is empty space in the margin in case you want to add some comments there. Often you do not want to have anything there, and you can set narrower margins to get a larger plotting area. There are many other graphical parameters that can be set. Many of these are described in ?par. The margins are set by parameter called mar which sets the margins as four numbers of margin widths in lines (rows of text) in order bottom, left, top, right. The following sets a bit narrower margins than the default

par(mar=c(4,4,1,1)+.1)
plot(mca)

You have only a limited control of basic plot. For instance, you can select either type=“t” (text) or type=“p” (points), but you cannot use points for plots and text for species, or you cannot select colours or sizes of symbols separately. However, you can first draw an empty plot (type=“n”, none), and then use commands points or text that add items to an existing plot

plot(mca, type="n")
points(mca, display = "sites", col="blue", pch=16)
text(mca, col="red", dis="sp")

Both functions are configurable as you see in ?text and ?points. Plotting characters (pch) of points can be given either as numbers (described in the help page), or as symbols (such as pch = “+”). For a quick survey of choices you can use commands example(text) and example(points)

6.2.4.2 Congested plots

Ordination plots are often crowded and messy: names are written over each other and may be difficult to read. For publications you need cleaner plots. One alternative is to use opaque labels for text with function ordilabel. These will still cover each other, but at least the uppermost will be readable. With argument priority you can select which labels are uppermost. The following draws an empty plot, adds sample plots as points, and then species names on opaque labels giving higher priority to more abundant species (high column sums)

plot(mca, type = "n")
points(mca, display="sites")
abu <- colSums(dune)
ordilabel(mca, col="red", dis="sp", fill="peachpuff", priority=abu)

Another alternative is to use function orditorp which uses text only if this does not cover previously added names, and points otherwise. The function also knows the priority argument

plot(mca, type = "n")
points(mca, display="sites", pch=23, col="red", bg="yellow")
orditorp(mca, dis = "sp", prio=abu, pch="+", pcol="gray")

Finally there is function ordipointlabel which uses both points and text to label the points. The function tries to locate the text to minimize the overlap. This is a slow numerical process, and will reach different results in most times, but can sometimes give decent results automatically (there are data sets with so many species or observations that it is impossible to label all neatly).

ordipointlabel(mca)

6.2.4.3 Alternatives

As a second example we examine a larger data set where a number of Dutch ditches were poisoned with insecticide Pyrifos, and the effects of the impact and recovery of animal plankton was followed. The sampling design is regular and the environmental data are automatically generated with command example, and the data set is described more thoroughly in help(pyrifos)

data(pyrifos)
example(pyrifos)
## 
## pyrifs> data(pyrifos)
## 
## pyrifs> ditch <- gl(12, 1, length=132)
## 
## pyrifs> week <- gl(11, 12, labels=c(-4, -1, 0.1, 1, 2, 4, 8, 12, 15, 19, 24))
## 
## pyrifs> dose <- factor(rep(c(0.1, 0, 0, 0.9, 0, 44, 6, 0.1, 44, 0.9, 0, 6), 11))

We use Detrended Correspondence Analysis (DCA) to demonstrate also that method

ditch
##   [1] 1  2  3  4  5  6  7  8  9  10 11 12 1  2  3  4  5  6  7  8  9  10 11
##  [24] 12 1  2  3  4  5  6  7  8  9  10 11 12 1  2  3  4  5  6  7  8  9  10
##  [47] 11 12 1  2  3  4  5  6  7  8  9  10 11 12 1  2  3  4  5  6  7  8  9 
##  [70] 10 11 12 1  2  3  4  5  6  7  8  9  10 11 12 1  2  3  4  5  6  7  8 
##  [93] 9  10 11 12 1  2  3  4  5  6  7  8  9  10 11 12 1  2  3  4  5  6  7 
## [116] 8  9  10 11 12 1  2  3  4  5  6  7  8  9  10 11 12
## Levels: 1 2 3 4 5 6 7 8 9 10 11 12
dose
##   [1] 0.1 0   0   0.9 0   44  6   0.1 44  0.9 0   6   0.1 0   0   0.9 0  
##  [18] 44  6   0.1 44  0.9 0   6   0.1 0   0   0.9 0   44  6   0.1 44  0.9
##  [35] 0   6   0.1 0   0   0.9 0   44  6   0.1 44  0.9 0   6   0.1 0   0  
##  [52] 0.9 0   44  6   0.1 44  0.9 0   6   0.1 0   0   0.9 0   44  6   0.1
##  [69] 44  0.9 0   6   0.1 0   0   0.9 0   44  6   0.1 44  0.9 0   6   0.1
##  [86] 0   0   0.9 0   44  6   0.1 44  0.9 0   6   0.1 0   0   0.9 0   44 
## [103] 6   0.1 44  0.9 0   6   0.1 0   0   0.9 0   44  6   0.1 44  0.9 0  
## [120] 6   0.1 0   0   0.9 0   44  6   0.1 44  0.9 0   6  
## Levels: 0 0.1 0.9 6 44
dca <- decorana(pyrifos)
dca
## 
## Call:
## decorana(veg = pyrifos) 
## 
## Detrended correspondence analysis with 26 segments.
## Rescaling of axes with 4 iterations.
## 
##                  DCA1   DCA2   DCA3   DCA4
## Eigenvalues     0.135 0.0884 0.0777 0.0629
## Decorana values 0.140 0.0933 0.0754 0.0651
## Axis lengths    1.516 1.7717 1.3735 1.5401

Compare the display of this analysis to other methods. The following gives ordinary CA for comparison

cap <- cca(pyrifos)
cap
## Call: cca(X = pyrifos)
## 
##               Inertia Rank
## Total            2.87     
## Unconstrained    2.87  131
## Inertia is mean squared contingency coefficient 
## 
## Eigenvalues for unconstrained axes:
##    CA1    CA2    CA3    CA4    CA5    CA6    CA7    CA8 
## 0.1399 0.0981 0.0903 0.0834 0.0772 0.0733 0.0695 0.0655 
## (Showed only 8 of all 131 unconstrained eigenvalues)
plot(procrustes(cap, dca))
## Warning in procrustes(cap, dca): X has fewer axes than Y: X adjusted to comform Y

plot(procrustes(dca, cap, choices=1:2))

Spinning is good for private use since it helps in understanding 3D structures, but it is difficult for papers. An alternative plotting command used Lattice or Trellis commands to produce panels of plots or alternative special plots. The following produces separate panels for each level of Pyrifos, displays each ditch with a different colour, and connects points within ditches by lines showing the temporal succession.

ordixyplot(dca, form = DCA2 ~ DCA1 | dose, group=ditch, type="b")

The form says that draw axis DCA2 against DCA1 making panel for each dose, and use lines and points (type = “b” or ???both??? lines and points) for each ditch. Please note that the sign before dose is not a slash but a vertical bar (|). Explain with this graph how the natural annual succession and the impact of Pyrifos can be seen.

6.2.4.4 Fitting environmental variables

The basic command to fit vectors or factor levels of environmental variables is envfit. The following example uses the two alternative result of NMDS Which environmental variables are most important? Which of these alternative ordinations seems to be better related to the environment?

data(dune.env)
envfit(m, dune.env)
## 
## ***VECTORS
## 
##    NMDS1 NMDS2   r2 Pr(>r)  
## A1 0.965 0.263 0.36  0.019 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 999
## 
## ***FACTORS:
## 
## Centroids:
##              NMDS1 NMDS2
## Moisture1    -0.51 -0.04
## Moisture2    -0.39  0.01
## Moisture4     0.28 -0.40
## Moisture5     0.66  0.15
## ManagementBF -0.45 -0.01
## ManagementHF -0.26 -0.13
## ManagementNM  0.30  0.58
## ManagementSF  0.15 -0.47
## UseHayfield  -0.16  0.32
## UseHaypastu  -0.04 -0.34
## UsePasture    0.29  0.08
## Manure0       0.30  0.58
## Manure1      -0.25 -0.02
## Manure2      -0.31 -0.19
## Manure3       0.31 -0.25
## Manure4      -0.35 -0.56
## 
## Goodness of fit:
##              r2 Pr(>r)    
## Moisture   0.50  0.001 ***
## Management 0.41  0.009 ** 
## Use        0.19  0.122    
## Manure     0.42  0.019 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 999
ef <- envfit(m, dune.env, perm=1000)
ef
## 
## ***VECTORS
## 
##    NMDS1 NMDS2   r2 Pr(>r)  
## A1 0.965 0.263 0.36  0.016 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 1000
## 
## ***FACTORS:
## 
## Centroids:
##              NMDS1 NMDS2
## Moisture1    -0.51 -0.04
## Moisture2    -0.39  0.01
## Moisture4     0.28 -0.40
## Moisture5     0.66  0.15
## ManagementBF -0.45 -0.01
## ManagementHF -0.26 -0.13
## ManagementNM  0.30  0.58
## ManagementSF  0.15 -0.47
## UseHayfield  -0.16  0.32
## UseHaypastu  -0.04 -0.34
## UsePasture    0.29  0.08
## Manure0       0.30  0.58
## Manure1      -0.25 -0.02
## Manure2      -0.31 -0.19
## Manure3       0.31 -0.25
## Manure4      -0.35 -0.56
## 
## Goodness of fit:
##              r2 Pr(>r)   
## Moisture   0.50  0.002 **
## Management 0.41  0.004 **
## Use        0.19  0.138   
## Manure     0.42  0.019 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 1000
ef2 <- envfit(m2, dune.env, perm=1000)
ef2
## 
## ***VECTORS
## 
##    NMDS1 NMDS2  r2 Pr(>r)
## A1 0.975 0.221 0.2   0.13
## Permutation: free
## Number of permutations: 1000
## 
## ***FACTORS:
## 
## Centroids:
##              NMDS1 NMDS2
## Moisture1    -4.52  0.75
## Moisture2    -1.82 -1.72
## Moisture4     2.88 -4.24
## Moisture5     4.73  1.44
## ManagementBF -4.83  0.05
## ManagementHF -3.25 -0.52
## ManagementNM  2.85  4.34
## ManagementSF  2.27 -3.94
## UseHayfield  -0.65  2.41
## UseHaypastu  -0.60 -2.77
## UsePasture    1.88  1.07
## Manure0       2.85  4.34
## Manure1      -2.96  0.02
## Manure2      -3.66 -1.30
## Manure3       2.66 -1.87
## Manure4      -1.41 -4.48
## 
## Goodness of fit:
##              r2 Pr(>r)    
## Moisture   0.51  0.001 ***
## Management 0.52  0.001 ***
## Use        0.17  0.173    
## Manure     0.46  0.018 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 1000

We can add the fitted variables to the model as vectors and centroids of factor levels. The vectors will be automatically scaled for the graph area. The following shows only the variables that were regarded as statistically significant at level P ??? 0.05:

plot(m)
plot(ef, add=T, p.=0.05)

plot(m2)
plot(ef2, add=T, p.=0.05)

6.3 Constrained Ordination

6.3.1 Simple Cases

If you only have continuous variables, you can constrain the ordination by adding a second argument to the call of cca or rda

m <- cca(varespec)
mm <- cca(varespec, varechem)
m
## Call: cca(X = varespec)
## 
##               Inertia Rank
## Total            2.08     
## Unconstrained    2.08   23
## Inertia is mean squared contingency coefficient 
## 
## Eigenvalues for unconstrained axes:
##   CA1   CA2   CA3   CA4   CA5   CA6   CA7   CA8 
## 0.525 0.357 0.234 0.195 0.178 0.122 0.115 0.089 
## (Showed only 8 of all 23 unconstrained eigenvalues)
mm
## Call: cca(X = varespec, Y = varechem)
## 
##               Inertia Proportion Rank
## Total           2.083      1.000     
## Constrained     1.441      0.692   14
## Unconstrained   0.642      0.308    9
## Inertia is mean squared contingency coefficient 
## 
## Eigenvalues for constrained axes:
##  CCA1  CCA2  CCA3  CCA4  CCA5  CCA6  CCA7  CCA8  CCA9 CCA10 CCA11 CCA12 
## 0.439 0.292 0.163 0.142 0.118 0.089 0.070 0.058 0.031 0.013 0.008 0.007 
## CCA13 CCA14 
## 0.006 0.005 
## 
## Eigenvalues for unconstrained axes:
##    CA1    CA2    CA3    CA4    CA5    CA6    CA7    CA8    CA9 
## 0.1978 0.1419 0.1012 0.0708 0.0533 0.0333 0.0189 0.0151 0.0095

Compare the output of these two ordinations. In particular, see how the inertia and and rank (number of axes) are decomposed in constrained ordination. The only way to select environmental variables from the full data set is to use a subset of the matrix:

cca(varespec, varechem[, c("Al", "P", "K")])
## Call: cca(X = varespec, Y = varechem[, c("Al", "P", "K")])
## 
##               Inertia Proportion Rank
## Total           2.083      1.000     
## Constrained     0.644      0.309    3
## Unconstrained   1.439      0.691   20
## Inertia is mean squared contingency coefficient 
## 
## Eigenvalues for constrained axes:
##  CCA1  CCA2  CCA3 
## 0.362 0.170 0.113 
## 
## Eigenvalues for unconstrained axes:
##   CA1   CA2   CA3   CA4   CA5   CA6   CA7   CA8 
## 0.350 0.220 0.185 0.155 0.135 0.100 0.077 0.054 
## (Showed only 8 of all 20 unconstrained eigenvalues)

For a full numerical survey, you can use summary. The summary lists scores for every species and every site ??? and for sites two different kind of scores. The output is long, but if you only want to get an idea of the style of the results, you can delimit the output to the head (or tail) of each type of scores:

head(summary(mm))
## 
## Call:
## cca(X = varespec, Y = varechem) 
## 
## Partitioning of mean squared contingency coefficient:
##               Inertia Proportion
## Total           2.083      1.000
## Constrained     1.441      0.692
## Unconstrained   0.642      0.308
## 
## Eigenvalues, and their contribution to the mean squared contingency coefficient 
## 
## Importance of components:
##                        CCA1  CCA2   CCA3   CCA4   CCA5   CCA6   CCA7
## Eigenvalue            0.439 0.292 0.1628 0.1421 0.1180 0.0890 0.0703
## Proportion Explained  0.211 0.140 0.0782 0.0682 0.0566 0.0427 0.0337
## Cumulative Proportion 0.211 0.351 0.4289 0.4971 0.5537 0.5965 0.6302
##                         CCA8   CCA9   CCA10   CCA11   CCA12   CCA13
## Eigenvalue            0.0584 0.0311 0.01329 0.00836 0.00654 0.00616
## Proportion Explained  0.0280 0.0149 0.00638 0.00402 0.00314 0.00296
## Cumulative Proportion 0.6583 0.6732 0.67958 0.68359 0.68673 0.68969
##                         CCA14    CA1    CA2    CA3    CA4    CA5    CA6
## Eigenvalue            0.00473 0.1978 0.1419 0.1012 0.0708 0.0533 0.0333
## Proportion Explained  0.00227 0.0949 0.0681 0.0486 0.0340 0.0256 0.0160
## Cumulative Proportion 0.69196 0.7869 0.8550 0.9036 0.9376 0.9631 0.9791
##                           CA7     CA8     CA9
## Eigenvalue            0.01887 0.01510 0.00949
## Proportion Explained  0.00906 0.00725 0.00455
## Cumulative Proportion 0.98820 0.99545 1.00000
## 
## Accumulated constrained eigenvalues
## Importance of components:
##                        CCA1  CCA2  CCA3   CCA4   CCA5   CCA6   CCA7   CCA8
## Eigenvalue            0.439 0.292 0.163 0.1421 0.1180 0.0890 0.0703 0.0584
## Proportion Explained  0.304 0.202 0.113 0.0986 0.0818 0.0618 0.0488 0.0405
## Cumulative Proportion 0.304 0.507 0.620 0.7184 0.8003 0.8620 0.9108 0.9513
##                         CCA9   CCA10   CCA11   CCA12   CCA13   CCA14
## Eigenvalue            0.0311 0.01329 0.00836 0.00654 0.00616 0.00473
## Proportion Explained  0.0216 0.00922 0.00580 0.00454 0.00427 0.00328
## Cumulative Proportion 0.9729 0.98211 0.98791 0.99245 0.99672 1.00000
## 
## Scaling 2 for species and site scores
## * Species are scaled proportional to eigenvalues
## * Sites are unscaled: weighted dispersion equal on all dimensions
## 
## 
## Species scores
## 
##             CCA1    CCA2    CCA3    CCA4    CCA5     CCA6
## Callvulg  0.0753 -0.9358  1.6777  0.6955  1.0775 -0.34500
## Empenigr -0.1813  0.0761  0.0365 -0.4277 -0.1382  0.01052
## Rhodtome -1.0535 -0.0603  0.0774 -0.9389 -0.2139 -0.51803
## Vaccmyrt -1.2774  0.3076  0.3037 -0.0921 -0.5688 -0.61302
## Vaccviti -0.1526  0.1205 -0.0530 -0.3623  0.0839  0.00894
## Pinusylv  0.2430  0.2643  0.2233 -0.2738  0.2921 -0.06334
## ....                                                     
## 
## 
## Site scores (weighted averages of species scores)
## 
##        CCA1   CCA2    CCA3    CCA4   CCA5    CCA6
## 18    0.178 -1.060 -0.4088 -0.6072 -0.565  0.2417
## 15   -0.970 -0.197  0.4210  0.3032  0.152  0.8039
## 24   -1.280  0.476 -2.9469  0.3929  3.954  0.7659
## 27   -1.501  0.652  0.0858  0.7621 -1.233 -0.0976
## 23   -0.598 -0.184 -0.1356 -1.1642 -0.302  0.0703
## 19   -0.110  0.714  0.0166 -0.0777 -0.552 -0.0826
## ....                                             
## 
## 
## Site constraints (linear combinations of constraining variables)
## 
##        CCA1   CCA2    CCA3   CCA4    CCA5   CCA6
## 18   -0.423 -1.325 -0.4921 -0.945 -0.0485  0.940
## 15   -0.190  0.497  0.4545 -0.530 -0.0766 -0.790
## 24   -0.863  0.252 -2.7604  0.570  3.2927  0.263
## 27   -1.698  0.487 -0.5635  1.074 -0.6141  0.499
## 23   -0.796  0.107  0.2575 -0.904 -0.2876  0.439
## 19   -0.677  1.001  0.0334 -1.004 -0.1413 -0.938
## ....                                            
## 
## 
## Biplot scores for constraining variables
## 
##             CCA1    CCA2     CCA3    CCA4     CCA5     CCA6
## N        -0.2229 -0.5289  0.00673  0.1773 -0.25322  0.10201
## P        -0.3187  0.5789 -0.16200  0.4795  0.18410 -0.12183
## K        -0.3661  0.3079  0.35982  0.4794  0.32544 -0.19664
## Ca       -0.4476  0.4218 -0.03776  0.0983  0.30797  0.04355
## Mg       -0.4350  0.3405 -0.14217  0.1079  0.49784 -0.00576
## S        -0.0241  0.4157  0.14838  0.4445  0.59706 -0.16630
## Al        0.7698 -0.0475  0.03761  0.3910  0.16090 -0.33655
## Fe        0.6491 -0.0881 -0.04207  0.2630 -0.06981 -0.11134
## Mn       -0.7223  0.2246  0.11305  0.2915 -0.13868  0.18047
## Zn       -0.3581  0.3349 -0.27792  0.3457  0.61919 -0.00119
## Mo        0.2041 -0.1033 -0.15701  0.3242  0.51644 -0.31352
## Baresoil -0.5367 -0.2548  0.13691 -0.5206  0.16662 -0.35241
## Humdepth -0.6967  0.2016  0.27163 -0.1357 -0.00325 -0.05135
## pH        0.4972  0.0751 -0.32634  0.0209 -0.14557 -0.05909

The summary also shows how the total inertia is divided between individual axes, and the lines for accounted inertia shows the accumulated variation explained by all axes, as well as the accounted constrained inertia. The two kind of site scores are often called WA scores and LC scores. The WA scores are derived from species scores, and the LC scores are derived from constraints as their linear combinations. The analysis tries to keep these two sets as similar as possible. Their similarity can be inspected with species??? environment correlation which simply is the correlation between LC and WA score for an axis.

spenvcor(mm)
##  CCA1  CCA2  CCA3  CCA4  CCA5  CCA6  CCA7  CCA8  CCA9 CCA10 CCA11 CCA12 
## 0.927 0.929 0.929 0.927 0.815 0.775 0.900 0.794 0.639 0.755 0.402 0.570 
## CCA13 CCA14 
## 0.592 0.588

The default plot displays species scores, WA scores and environmental variables.

plot(mm)

You can change this by giving the displayed elements in argument display. The following shows only LC scores and environmental variables:

plot(mm, display = c("lc","bp"))

The items are called “sp”, “wa”, “lc”, “bp”, “cn”, which are self explanatory, except the last which refers to the centroids of the environmental variables. You can visually inspect the difference between WA and LC scores (or the species ??? environment relationship) with command ordispider which (among other alternatives) joins WA and LC scores by a line:

plot(mm, dis=c("wa","lc"))
ordispider(mm)

It is often a bad idea to use constrained ordination if you have a very large number of constraints: probably you will not constrain at all, but your analysis may be very similar to ordinary unconstrained analysis that could have been used just as well. You can inspect this with Procrustes rotation:

plot(procrustes(m, mm))

You can also see how similar the ordinations are by fitting environmental variables to unconstrained ordination.

plot(m)
plot(envfit(m, varechem))

6.3.2 Formula Interface

You can use formula interface to select the variables used as constraints:

cca(varespec ~  Al + P + K, data=varechem)
## Call: cca(formula = varespec ~ Al + P + K, data = varechem)
## 
##               Inertia Proportion Rank
## Total           2.083      1.000     
## Constrained     0.644      0.309    3
## Unconstrained   1.439      0.691   20
## Inertia is mean squared contingency coefficient 
## 
## Eigenvalues for constrained axes:
##  CCA1  CCA2  CCA3 
## 0.362 0.170 0.113 
## 
## Eigenvalues for unconstrained axes:
##   CA1   CA2   CA3   CA4   CA5   CA6   CA7   CA8 
## 0.350 0.220 0.185 0.155 0.135 0.100 0.077 0.054 
## (Showed only 8 of all 20 unconstrained eigenvalues)

This indeed is the recommended method of defining the model, because it allows full control of the model and some further analyses are possible only when the model was defined through a formula. With formula interface you can also use factor constraints:

data(dune)
data(dune.env)
str(dune.env)
## 'data.frame':    20 obs. of  5 variables:
##  $ A1        : num  2.8 3.5 4.3 4.2 6.3 4.3 2.8 4.2 3.7 3.3 ...
##  $ Moisture  : Ord.factor w/ 4 levels "1"<"2"<"4"<"5": 1 1 2 2 1 1 1 4 3 2 ...
##  $ Management: Factor w/ 4 levels "BF","HF","NM",..: 4 1 4 4 2 2 2 2 2 1 ...
##  $ Use       : Ord.factor w/ 3 levels "Hayfield"<"Haypastu"<..: 2 2 2 2 1 2 3 3 1 1 ...
##  $ Manure    : Ord.factor w/ 5 levels "0"<"1"<"2"<"3"<..: 5 3 5 5 3 3 4 4 2 2 ...
summary(dune.env)
##        A1        Moisture Management       Use    Manure
##  Min.   : 2.80   1:7      BF:3       Hayfield:7   0:6   
##  1st Qu.: 3.50   2:4      HF:5       Haypastu:8   1:3   
##  Median : 4.20   4:2      NM:6       Pasture :5   2:4   
##  Mean   : 4.85   5:7      SF:6                    3:4   
##  3rd Qu.: 5.72                                    4:3   
##  Max.   :11.50
mdun <- cca(dune ~  Management + A1, dune.env)
mdun
## Call: cca(formula = dune ~ Management + A1, data = dune.env)
## 
##               Inertia Proportion Rank
## Total           2.115      1.000     
## Constrained     0.780      0.369    4
## Unconstrained   1.335      0.631   15
## Inertia is mean squared contingency coefficient 
## 
## Eigenvalues for constrained axes:
##  CCA1  CCA2  CCA3  CCA4 
## 0.319 0.237 0.132 0.092 
## 
## Eigenvalues for unconstrained axes:
##   CA1   CA2   CA3   CA4   CA5   CA6   CA7   CA8   CA9  CA10  CA11  CA12 
## 0.362 0.203 0.153 0.135 0.111 0.080 0.077 0.055 0.044 0.042 0.032 0.018 
##  CA13  CA14  CA15 
## 0.012 0.009 0.005
plot(mdun)

Note here how the use of factors increases the rank of the constrained solution. If you have p levels of a factor, you will get p ??? 1 axes ??? provided your factor levels really are independent. If you only have continuous variables, you can constrain the ordination by adding a second argument to the call of cca or rda

6.3.2.1 Model Choice

With formula we have a full control of the model, but we face with the problem of model choice. Models must be built carefully, and preferably used to test specific hypotheses. Sometimes we may want to use automatic model building, but this must be done carefully. There are some shortcuts and tricks for this in vegan, but these should be used with utmost care. In automatic model building we usually need two extreme models: the small- est and the largest model considered. The following shortcuts build a model with all environmental variables and a model with no environmental variables, but both with a formula so that terms can be added or removed from the model

m1 <- cca(varespec ~ ., varechem)
m0 <- cca(varespec ~ 1, varechem)
m1
## Call: cca(formula = varespec ~ N + P + K + Ca + Mg + S + Al + Fe +
## Mn + Zn + Mo + Baresoil + Humdepth + pH, data = varechem)
## 
##               Inertia Proportion Rank
## Total           2.083      1.000     
## Constrained     1.441      0.692   14
## Unconstrained   0.642      0.308    9
## Inertia is mean squared contingency coefficient 
## 
## Eigenvalues for constrained axes:
##  CCA1  CCA2  CCA3  CCA4  CCA5  CCA6  CCA7  CCA8  CCA9 CCA10 CCA11 CCA12 
## 0.439 0.292 0.163 0.142 0.118 0.089 0.070 0.058 0.031 0.013 0.008 0.007 
## CCA13 CCA14 
## 0.006 0.005 
## 
## Eigenvalues for unconstrained axes:
##    CA1    CA2    CA3    CA4    CA5    CA6    CA7    CA8    CA9 
## 0.1978 0.1419 0.1012 0.0708 0.0533 0.0333 0.0189 0.0151 0.0095
m0
## Call: cca(formula = varespec ~ 1, data = varechem)
## 
##               Inertia Rank
## Total            2.08     
## Unconstrained    2.08   23
## Inertia is mean squared contingency coefficient 
## 
## Eigenvalues for unconstrained axes:
##   CA1   CA2   CA3   CA4   CA5   CA6   CA7   CA8 
## 0.525 0.357 0.234 0.195 0.178 0.122 0.115 0.089 
## (Showed only 8 of all 23 unconstrained eigenvalues)

Then we can use step function to select the ???best??? model. The step function uses Akaike???s Information Criterion (AIC) in model choice. The AIC is based on the goodness of fit (high constrained inertia), but it is penalized by the number of estimated parameters (constrained rank). The alternative models are ordered by AIC. In each case, ???+??? indicated the effect of adding a term, and ???-??? the effect of removing a term, while the current model is marked as ??????. The model building proceeds by steps until the current model (??????) is the best.

m <- step(m0, scope=formula(m1), test="perm")
## Start:  AIC=130
## varespec ~ 1
## 
##            Df AIC    F Pr(>F)   
## + Al        1 129 3.67  0.005 **
## + Mn        1 129 3.31  0.010 **
## + Humdepth  1 129 3.01  0.010 **
## + Baresoil  1 130 2.46  0.020 * 
## + Fe        1 130 2.44  0.025 * 
## + P         1 130 2.19  0.035 * 
## + Zn        1 130 1.93  0.065 . 
## <none>        130               
## + Mg        1 130 1.87  0.035 * 
## + K         1 130 1.86  0.085 . 
## + Ca        1 130 1.80  0.075 . 
## + pH        1 131 1.66  0.105   
## + S         1 131 1.51  0.100 . 
## + N         1 131 1.46  0.135   
## + Mo        1 131 1.06  0.425   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=129
## varespec ~ Al
## 
##            Df AIC    F Pr(>F)   
## + P         1 128 2.50  0.010 **
## + K         1 128 2.32  0.010 **
## + S         1 128 2.16  0.015 * 
## + Zn        1 128 1.99  0.020 * 
## + Mn        1 128 1.89  0.030 * 
## <none>        129               
## + Mg        1 129 1.74  0.055 . 
## + N         1 129 1.59  0.115   
## + Baresoil  1 129 1.57  0.095 . 
## + Ca        1 129 1.42  0.170   
## + Humdepth  1 129 1.38  0.210   
## + Mo        1 130 0.99  0.380   
## + pH        1 130 0.88  0.590   
## + Fe        1 130 0.52  0.880   
## - Al        1 130 3.67  0.005 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=128
## varespec ~ Al + P
## 
##            Df AIC    F Pr(>F)   
## + K         1 127 2.17  0.035 * 
## <none>        128               
## + Baresoil  1 128 1.66  0.040 * 
## + N         1 128 1.55  0.085 . 
## + S         1 128 1.34  0.230   
## + Mn        1 128 1.26  0.215   
## + Zn        1 128 1.20  0.255   
## + Humdepth  1 129 1.15  0.340   
## - P         1 129 2.50  0.015 * 
## + Mo        1 129 0.98  0.415   
## + Mg        1 129 0.96  0.450   
## + pH        1 129 0.92  0.540   
## + Fe        1 129 0.53  0.895   
## + Ca        1 129 0.46  0.900   
## - Al        1 130 3.94  0.005 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=127
## varespec ~ Al + P + K
## 
##            Df AIC    F Pr(>F)   
## <none>        127               
## + N         1 128 1.51  0.120   
## + Baresoil  1 128 1.45  0.140   
## + Zn        1 128 1.31  0.230   
## + S         1 128 1.26  0.305   
## - K         1 128 2.17  0.035 * 
## + Mo        1 128 1.24  0.310   
## - P         1 128 2.34  0.015 * 
## + Mg        1 128 1.03  0.415   
## + Mn        1 128 0.89  0.530   
## + Humdepth  1 128 0.81  0.565   
## + Fe        1 129 0.52  0.895   
## + pH        1 129 0.51  0.905   
## + Ca        1 129 0.44  0.930   
## - Al        1 130 4.33  0.005 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With continuous variables, the model building often works rather well, but it should not be trusted blindly (if at all). You can see this if you use rda instead of cca, or dune and dune.env data sets instead of lichen pastures (you may try). Moreover, you may end up with different models if you change the modelling strategy. The following simplifies the maximum model:

mback <- step(m1, test="perm")
## Start:  AIC=130
## varespec ~ N + P + K + Ca + Mg + S + Al + Fe + Mn + Zn + Mo + 
##     Baresoil + Humdepth + pH
## 
##            Df AIC    F Pr(>F)  
## - Fe        1 130 0.67  0.760  
## <none>        130              
## - Ca        1 130 0.87  0.540  
## - N         1 130 0.88  0.570  
## - Zn        1 130 0.90  0.550  
## - pH        1 130 0.95  0.505  
## - Al        1 130 0.96  0.560  
## - Mo        1 131 1.09  0.390  
## - Mn        1 131 1.14  0.340  
## - Mg        1 131 1.24  0.235  
## - Baresoil  1 131 1.26  0.260  
## - P         1 132 1.45  0.190  
## - K         1 132 1.46  0.175  
## - S         1 132 1.46  0.155  
## - Humdepth  1 133 1.90  0.055 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=130
## varespec ~ N + P + K + Ca + Mg + S + Al + Mn + Zn + Mo + Baresoil + 
##     Humdepth + pH
## 
##            Df AIC    F Pr(>F)  
## - Al        1 129 0.64  0.835  
## - pH        1 129 0.69  0.725  
## - N         1 130 0.76  0.755  
## - Zn        1 130 0.78  0.685  
## <none>        130              
## - Ca        1 130 0.91  0.490  
## - Mn        1 130 0.98  0.495  
## - Mg        1 131 1.26  0.240  
## - Mo        1 131 1.30  0.265  
## - Baresoil  1 131 1.36  0.230  
## - S         1 131 1.43  0.150  
## - K         1 131 1.59  0.110  
## - P         1 131 1.60  0.145  
## - Humdepth  1 132 1.72  0.075 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=129
## varespec ~ N + P + K + Ca + Mg + S + Mn + Zn + Mo + Baresoil + 
##     Humdepth + pH
## 
##            Df AIC    F Pr(>F)  
## - N         1 129 0.73  0.745  
## <none>        129              
## - Ca        1 129 0.96  0.450  
## - Mn        1 129 1.00  0.480  
## - Zn        1 130 1.06  0.390  
## - pH        1 130 1.06  0.370  
## - Mg        1 130 1.32  0.175  
## - Baresoil  1 130 1.41  0.160  
## - K         1 130 1.58  0.090 .
## - P         1 131 1.62  0.110  
## - Mo        1 131 1.70  0.105  
## - Humdepth  1 131 1.82  0.055 .
## - S         1 131 2.01  0.020 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=129
## varespec ~ P + K + Ca + Mg + S + Mn + Zn + Mo + Baresoil + Humdepth + 
##     pH
## 
##            Df AIC    F Pr(>F)   
## - pH        1 128 0.89  0.505   
## - Ca        1 129 1.03  0.405   
## <none>        129               
## - Zn        1 129 1.06  0.460   
## - Mn        1 129 1.10  0.350   
## - Baresoil  1 130 1.42  0.155   
## - Mg        1 130 1.54  0.120   
## - P         1 130 1.66  0.145   
## - Humdepth  1 130 1.74  0.095 . 
## - Mo        1 130 1.85  0.085 . 
## - K         1 131 2.12  0.040 * 
## - S         1 132 2.68  0.010 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=129
## varespec ~ P + K + Ca + Mg + S + Mn + Zn + Mo + Baresoil + Humdepth
## 
##            Df AIC    F Pr(>F)   
## - Ca        1 128 0.90  0.585   
## - Zn        1 128 1.09  0.370   
## - Mn        1 128 1.10  0.345   
## <none>        128               
## - Mg        1 129 1.51  0.105   
## - Baresoil  1 129 1.62  0.130   
## - P         1 129 1.69  0.100 . 
## - Mo        1 130 1.70  0.065 . 
## - Humdepth  1 130 1.86  0.065 . 
## - K         1 130 2.20  0.010 **
## - S         1 131 2.68  0.005 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=128
## varespec ~ P + K + Mg + S + Mn + Zn + Mo + Baresoil + Humdepth
## 
##            Df AIC    F Pr(>F)  
## - Zn        1 128 1.07  0.465  
## - Mg        1 128 1.18  0.330  
## - Mn        1 128 1.20  0.305  
## <none>        128              
## - Baresoil  1 129 1.61  0.130  
## - P         1 129 1.80  0.100 .
## - Humdepth  1 129 1.84  0.045 *
## - Mo        1 129 2.03  0.025 *
## - K         1 129 2.06  0.025 *
## - S         1 130 2.26  0.020 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=128
## varespec ~ P + K + Mg + S + Mn + Mo + Baresoil + Humdepth
## 
##            Df AIC    F Pr(>F)  
## <none>        128              
## - Mn        1 128 1.45  0.155  
## - Baresoil  1 128 1.60  0.100 .
## - P         1 129 1.83  0.065 .
## - Mg        1 129 1.92  0.045 *
## - Humdepth  1 129 1.97  0.030 *
## - K         1 129 2.02  0.050 *
## - S         1 129 2.13  0.020 *
## - Mo        1 129 2.31  0.025 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Compare the order in which variables were added in the first model, and removed in this model.

The constrained ordination methods really do not have AIC, although it is calculated in step. Function ordistep uses significance test based on permutation instead of the non-existing AIC.3 The results of random permutations really are random, and therefore the model choice can have a random component. The usage is almost identical to the AIC based step:

m <- ordistep(m0, scope = formula(m1))
## 
## Start: varespec ~ 1 
## 
##            Df AIC    F Pr(>F)   
## + Al        1 129 3.67  0.005 **
## + Mn        1 129 3.31  0.005 **
## + Humdepth  1 129 3.01  0.005 **
## + Baresoil  1 130 2.46  0.010 **
## + Fe        1 130 2.44  0.015 * 
## + P         1 130 2.19  0.035 * 
## + Zn        1 130 1.93  0.050 * 
## + Ca        1 130 1.80  0.055 . 
## + Mg        1 130 1.87  0.060 . 
## + K         1 130 1.86  0.060 . 
## + S         1 131 1.51  0.095 . 
## + pH        1 131 1.66  0.120   
## + N         1 131 1.46  0.190   
## + Mo        1 131 1.06  0.390   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step: varespec ~ Al 
## 
##      Df AIC    F Pr(>F)   
## - Al  1 130 3.67   0.01 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##            Df AIC    F Pr(>F)   
## + P         1 128 2.50  0.010 **
## + K         1 128 2.32  0.025 * 
## + S         1 128 2.16  0.025 * 
## + Zn        1 128 1.99  0.025 * 
## + Mn        1 128 1.89  0.060 . 
## + N         1 129 1.59  0.080 . 
## + Mg        1 129 1.74  0.085 . 
## + Baresoil  1 129 1.57  0.100 . 
## + Ca        1 129 1.42  0.155   
## + Humdepth  1 129 1.38  0.170   
## + Mo        1 130 0.99  0.395   
## + pH        1 130 0.88  0.590   
## + Fe        1 130 0.52  0.910   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step: varespec ~ Al + P 
## 
##      Df AIC    F Pr(>F)   
## - P   1 129 2.50  0.015 * 
## - Al  1 130 3.94  0.005 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##            Df AIC    F Pr(>F)  
## + K         1 127 2.17  0.040 *
## + N         1 128 1.55  0.075 .
## + Baresoil  1 128 1.66  0.100 .
## + S         1 128 1.34  0.185  
## + Mn        1 128 1.26  0.230  
## + Zn        1 128 1.20  0.275  
## + Humdepth  1 129 1.15  0.315  
## + pH        1 129 0.92  0.460  
## + Mg        1 129 0.96  0.485  
## + Mo        1 129 0.98  0.490  
## + Fe        1 129 0.53  0.895  
## + Ca        1 129 0.46  0.915  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step: varespec ~ Al + P + K 
## 
##      Df AIC    F Pr(>F)   
## - K   1 128 2.17  0.025 * 
## - P   1 128 2.34  0.005 **
## - Al  1 130 4.33  0.005 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##            Df AIC    F Pr(>F)
## + Baresoil  1 128 1.45   0.15
## + N         1 128 1.51   0.16
## + Mo        1 128 1.24   0.20
## + Zn        1 128 1.31   0.23
## + S         1 128 1.26   0.27
## + Mg        1 128 1.03   0.40
## + Humdepth  1 128 0.81   0.57
## + Mn        1 128 0.89   0.58
## + Ca        1 129 0.44   0.85
## + pH        1 129 0.51   0.87
## + Fe        1 129 0.52   0.92
m$anova

One problem with model building is that constraining variables are not in- dependent, but they are correlated. Any correlated variable can be explained with other variables. Such variables are redundant (???expendable???) when they are with other variables, but they may be the best variables alone and prevent other variables from entering the model. A statistic describing this is called variance inflation factor (VIF) which is 1 for completely independent variables, and values above 10 or 20 (depending on your taste) are regarded as highly multicollinear (dependent on others). The VIF of a variable will depend on the set it sits with:

vif.cca(m1)
##        N        P        K       Ca       Mg        S       Al       Fe 
##     1.98     6.03    12.01     9.93     9.81    18.38    21.19     9.13 
##       Mn       Zn       Mo Baresoil Humdepth       pH 
##     5.38     7.74     4.32     2.25     6.01     7.39
vif.cca(m)
##   Al    P    K 
## 1.01 2.37 2.38
vif.cca(mback)
##        P        K       Mg        S       Mn       Mo Baresoil Humdepth 
##     3.93     7.63     2.86     9.25     3.47     2.45     1.68     2.84

6.3.2.2 Updating a model

You can manually build models if the automatic procedure fails. For instance, it does not work for Dutch dunes, where the building stops too early:

m0 <- cca(dune ~  1, dune.env)
m0
## Call: cca(formula = dune ~ 1, data = dune.env)
## 
##               Inertia Rank
## Total            2.12     
## Unconstrained    2.12   19
## Inertia is mean squared contingency coefficient 
## 
## Eigenvalues for unconstrained axes:
##   CA1   CA2   CA3   CA4   CA5   CA6   CA7   CA8 
## 0.536 0.400 0.260 0.176 0.145 0.108 0.092 0.081 
## (Showed only 8 of all 19 unconstrained eigenvalues)
m1 <- cca(dune ~ ., dune.env)
m1
## Call: cca(formula = dune ~ A1 + Moisture + Management + Use +
## Manure, data = dune.env)
## 
##               Inertia Proportion Rank
## Total           2.115      1.000     
## Constrained     1.503      0.711   12
## Unconstrained   0.612      0.289    7
## Inertia is mean squared contingency coefficient 
## Some constraints were aliased because they were collinear (redundant)
## 
## Eigenvalues for constrained axes:
##  CCA1  CCA2  CCA3  CCA4  CCA5  CCA6  CCA7  CCA8  CCA9 CCA10 CCA11 CCA12 
## 0.467 0.341 0.176 0.153 0.095 0.070 0.059 0.050 0.032 0.026 0.023 0.011 
## 
## Eigenvalues for unconstrained axes:
##    CA1    CA2    CA3    CA4    CA5    CA6    CA7 
## 0.2724 0.1088 0.0897 0.0631 0.0349 0.0253 0.0180
m <- step(m0, scope=formula(m1), test="p")
## Start:  AIC=87.7
## dune ~ 1
## 
##              Df  AIC    F Pr(>F)   
## + Moisture    3 86.6 2.25  0.005 **
## + Management  3 86.9 2.13  0.005 **
## + A1          1 87.4 2.14  0.030 * 
## <none>          87.7               
## + Manure      4 88.8 1.53  0.025 * 
## + Use         2 89.1 1.14  0.190   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=86.6
## dune ~ Moisture
## 
##              Df  AIC    F Pr(>F)   
## <none>          86.6               
## + Management  3 86.8 1.46  0.050 * 
## + A1          1 87.0 1.26  0.175   
## + Use         2 87.3 1.28  0.100 . 
## + Manure      4 87.3 1.31  0.085 . 
## - Moisture    3 87.7 2.25  0.005 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
m
## Call: cca(formula = dune ~ Moisture, data = dune.env)
## 
##               Inertia Proportion Rank
## Total           2.115      1.000     
## Constrained     0.628      0.297    3
## Unconstrained   1.487      0.703   16
## Inertia is mean squared contingency coefficient 
## 
## Eigenvalues for constrained axes:
##  CCA1  CCA2  CCA3 
## 0.419 0.133 0.077 
## 
## Eigenvalues for unconstrained axes:
##   CA1   CA2   CA3   CA4   CA5   CA6   CA7   CA8   CA9  CA10  CA11  CA12 
## 0.410 0.226 0.176 0.123 0.108 0.091 0.086 0.061 0.057 0.047 0.042 0.020 
##  CA13  CA14  CA15  CA16 
## 0.014 0.010 0.009 0.008

You can use update command where you change the formula. The retained parts of the formula are shown by a dot (.) and terms are added with + or removed with -. The following keeps the dependent variable (left hand side) and all previously entered constraints (hence ???. ~ .???), and adds Management to the previous model:

m <- update(m, . ~ . + Management)
m
## Call: cca(formula = dune ~ Moisture + Management, data = dune.env)
## 
##               Inertia Proportion Rank
## Total           2.115      1.000     
## Constrained     1.002      0.474    6
## Unconstrained   1.113      0.526   13
## Inertia is mean squared contingency coefficient 
## 
## Eigenvalues for constrained axes:
##  CCA1  CCA2  CCA3  CCA4  CCA5  CCA6 
## 0.446 0.289 0.112 0.072 0.049 0.034 
## 
## Eigenvalues for unconstrained axes:
##   CA1   CA2   CA3   CA4   CA5   CA6   CA7   CA8   CA9  CA10  CA11  CA12 
## 0.350 0.152 0.125 0.110 0.092 0.077 0.059 0.048 0.037 0.022 0.021 0.011 
##  CA13 
## 0.008

You can see if any other terms should be added to this model, or if removing an included term could improve the model if constraints are correlated, the significance can change with adding or removing variables from the model:

add1(m, scope=formula(m1), test="perm")
drop1(m, test="perm")

If needed, you can manually continue model building.

6.3.3 Significance tests

We already saw significance tests with model building. These tests were based on permutation: there is no known distribution of inertia that could be used for strict statistical testing. We simply permute the rows of community data, repeat the analysis, and get a random result. If our observed result is better than most of the random models (say, better than 95% of them), we say that our results are significant. Package vegan has several alternative types of significance tests. They all can be performed with a function called anova. The name is somewhat misleading: the test are based on permutations although the layout of the results is similar as in the standard ANOVA table. The default is an overall test of all variables together

anova(m)

We actually used function anova.cca, but you do not need to give its name in full, because R automatically chooses the correct anova variant for the result of constrained ordination. This kind of functions that are automatically adapted to the result type are called method functions. We have already seen many of them: for instance plot and summary work differently for different types of results. The anova.cca function tries to be clever and lazy: it automatically stops if the observed permutation significance probably differs from the targeted critical value (P = 0.05 as default), but it will continue long in uncertain cases. You must set step and perm.max to same values to override this behaviour. It is also possible to analyse terms separately:

anova(m, by="term", permu=200)

In this case, the function is unable to automatically select the number of iterations. This test is sequential: the terms are analysed in the order they happen to be in the model.

Moreover, it is possible to analyse significance of each axis:

anova(m, by="axis", perm=500)

Now the automatic selection works, but typically some of your axes will be very close to the critical value, and it may be useful to set a lower perm.max than the default 10000 (typically you use higher limits than in these examples: we used lower limits to save time when this document is automatically generated).

6.3.4 Partial ordination

All constrained ordination methods can have terms that are partialled out from the analysis before constraints:

m <- cca(dune ~ Management + Condition(Moisture + A1), data=dune.env)
m
## Call: cca(formula = dune ~ Management + Condition(Moisture + A1),
## data = dune.env)
## 
##               Inertia Proportion Rank
## Total           2.115      1.000     
## Conditional     0.744      0.352    4
## Constrained     0.395      0.187    3
## Unconstrained   0.976      0.461   12
## Inertia is mean squared contingency coefficient 
## 
## Eigenvalues for constrained axes:
##   CCA1   CCA2   CCA3 
## 0.2385 0.0865 0.0704 
## 
## Eigenvalues for unconstrained axes:
##    CA1    CA2    CA3    CA4    CA5    CA6    CA7    CA8    CA9   CA10 
## 0.3064 0.1319 0.1152 0.1095 0.0772 0.0758 0.0487 0.0376 0.0311 0.0210 
##   CA11   CA12 
## 0.0125 0.0093

This partials out the effect of Moisture and A1 before analysing the effects of Management. This also influences the significances of the terms:

anova(m, by="term", perm=500)

If we had a designed experiment, we may wish to restrict the permutations so that the observations only are permuted within levels of strata:

with(dune.env, anova(m, by="term", perm=500, strata=Moisture))

Here with() is a special function that makes variables in dune.env visible to the following command. If you only type Moisture in an R prompt, you will get an error of missing variables. Functions with formulae have a data argument giving the name of the data frame from which the variables are found, but other functions usually do not have such an argument. Instead of with(dune.env, command()), you can first attach(dune.env) and after that all variables in the data frame are visible in the session. This may be dangerous if you have similar names in your session and several attached data frames: it is difficult to know which of these was used.

6.3.5 Variance partitionning

The basic partial ordination gives decomposition of the inertia into conditional (partialled out) and (residual) constrained component. The decomposition is sequential: conditions are partialled out before analysing constraints. Reversing the order of terms changes the components. Often we want to have a neutral decomposition of variation into unique and shared components of several sources. For this we can use function varpart of vegan which can handle upto four sources of variation. Function varpart partitions adjusted R2. Unlike ordinary R2 or variance, adjusted R2 is unbiased and its expected value is R2 = 0 for random data. However, it can be easily calculated only for RDA. In principle, CCA can also be used, but calculations are lengthy and complicated, and no software is implemented for them.

The example above can be analysed using command:

mod <- varpart(dune, ~ Management, ~ A1 + Moisture, data = dune.env)
mod
## 
## Partition of variance in RDA 
## 
## Call: varpart(Y = dune, X = ~Management, ~A1 + Moisture, data =
## dune.env)
## 
## Explanatory tables:
## X1:  ~Management
## X2:  ~A1 + Moisture 
## 
## No. of explanatory tables: 2 
## Total variation (SS): 1598.4 
##             Variance: 84.124 
## No. of observations: 20 
## 
## Partition table:
##                      Df R.squared Adj.R.squared Testable
## [a+b] = X1            3   0.34747       0.22512     TRUE
## [b+c] = X2            4   0.35382       0.18150     TRUE
## [a+b+c] = X1+X2       7   0.58105       0.33666     TRUE
## Individual fractions                                    
## [a] = X1|X2           3                 0.15515     TRUE
## [b]                   0                 0.06997    FALSE
## [c] = X2|X1           4                 0.11153     TRUE
## [d] = Residuals                         0.66334    FALSE
## ---
## Use function 'rda' to test significance of fractions of interest

The first argument (dune) gives the dependent data, and the next two to four arguments give the sources of variation. These can be single variables, matrices or one-sided model formulae like above.

The output can be tricky to read: names X1, X2 etc refer to the original sources. These are divided into unique and shared components. In our example, the source X1 (Management) has two components: unique component a and a component b that is shared with the source X2. Source X2 consists of its unique component c and the same shared component b. Consquently, X1 = a + b and X2 = b + c. The unique component is the variation that can be explained only by the source and not by other source, and it is shown with operator | in the output: a = X1 | X2. The lower case letters can be deciphered using:

showvarparts(2)

showvarparts(3)

and the decomposition can be displayed in the same diagrams with

plot(mod)

The decomposition can be done by hand:

ab <- rda(dune ~ Management, dune.env)
bc <- rda(dune ~ A1 + Moisture, dune.env)
abc <- rda(dune ~ Management + A1 + Moisture, dune.env)
a <- rda(dune ~ Management + Condition(A1 + Moisture), dune.env)
c <- rda(dune ~ A1 + Moisture + Condition(Management), dune.env)

However, the shared component b cannot be found with a direct model. All the models above define testable models, but component b is non-testable and can be only found indirectly as a difference of the models. The components of variation can be negative for various reasons. Simplest reasons is that they are negative because of random variation in adjusted R2, but there can be other, more complicated reasons (like multivariate non-linear dependence between sources of variation).